python - Memory efficient way to keep paired lines (string match) -


i have large text files need sort through , remove unpaired lines. paired lines consecutive lines have same 32 characters @ beginning of each line. have script written in python while loop iterates through lines, compares first 32 characters of lines , (i+1), outputs lines in pairs. however, method memory intensive , slow since each file can several gigabytes. there faster , more efficient method recommend? reference, working on slurm linux server.

this uses 2 favorite python modules, itertools , collections. use itertools.groupby walk lines in file, grouping common prefix. use next() pull first element lines iterator, , 0-length deque consume remainder of lines iterator.

from itertools import groupby collections import deque consume = deque(maxlen=0).extend operator import itemgetter  prefix_slice = slice(0,32) open('bigfile.txt') infile):     _,lines in groupby(infile, key=itemgetter(prefix_slice)):         print next(lines).rstrip('\n')         # have consume iterator on remaining lines before         # advancing next groupby key         consume(lines) 

this hold single line in memory @ time, plus 32-character prefix comparison following lines. (will collapse consecutive lines share common 32-character prefix, not pairs.)


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -