python - Memory efficient way to keep paired lines (string match) -
i have large text files need sort through , remove unpaired lines. paired lines consecutive lines have same 32 characters @ beginning of each line. have script written in python while loop iterates through lines, compares first 32 characters of lines , (i+1), outputs lines in pairs. however, method memory intensive , slow since each file can several gigabytes. there faster , more efficient method recommend? reference, working on slurm linux server.
this uses 2 favorite python modules, itertools , collections. use itertools.groupby walk lines in file, grouping common prefix. use next() pull first element lines iterator, , 0-length deque consume remainder of lines iterator.
from itertools import groupby collections import deque consume = deque(maxlen=0).extend operator import itemgetter prefix_slice = slice(0,32) open('bigfile.txt') infile): _,lines in groupby(infile, key=itemgetter(prefix_slice)): print next(lines).rstrip('\n') # have consume iterator on remaining lines before # advancing next groupby key consume(lines) this hold single line in memory @ time, plus 32-character prefix comparison following lines. (will collapse consecutive lines share common 32-character prefix, not pairs.)
Comments
Post a Comment