python - Memory efficient way to keep paired lines (string match) -
i have large text files need sort through , remove unpaired lines. paired lines consecutive lines have same 32 characters @ beginning of each line. have script written in python while loop iterates through lines, compares first 32 characters of lines , (i+1), outputs lines in pairs. however, method memory intensive , slow since each file can several gigabytes. there faster , more efficient method recommend? reference, working on slurm linux server.
this uses 2 favorite python modules, itertools
, collections
. use itertools.groupby
walk lines in file, grouping common prefix. use next()
pull first element lines
iterator, , 0-length deque
consume remainder of lines
iterator.
from itertools import groupby collections import deque consume = deque(maxlen=0).extend operator import itemgetter prefix_slice = slice(0,32) open('bigfile.txt') infile): _,lines in groupby(infile, key=itemgetter(prefix_slice)): print next(lines).rstrip('\n') # have consume iterator on remaining lines before # advancing next groupby key consume(lines)
this hold single line in memory @ time, plus 32-character prefix comparison following lines. (will collapse consecutive lines share common 32-character prefix, not pairs.)
Comments
Post a Comment