regex - Python Find n words before and after a certain words -
lets have text file. should read , like:
... department of called (dos) , more texts , more text...
and "while" reading text file find acronym, here
dos
so finding acronym wrote:
import re import numpy # open file? test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex = r'\b[a-z][a-za-z\.]*[a-z]\b\.?' found= re.findall(regex, test_string) print found
and output is:
['dos']
what want is:
- while reading file , find , acronym (here dos),
- calculate number of characters of found (here 3 chars dos)
find 2 times (here 2x3=6) words before , after 'dos'. here be:
3.1 pre= department of called 3.2 acronym= dos 3.3 post= , more texts , more
- put these 3 (pre, acronym, post) in array.
any appreciated since new python.
not sure if best solution, maybe it's enough you.
import re import numpy # open file? test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex_acronym = r'\b[a-z][a-za-z\.]*[a-z]\b\.?' ra = re.compile(regex_acronym) m in ra.finditer(test_string): print m.start(), m.group(), m.span() n = len(m.group()) * 2 regex_pre_post = r"((?:[a-za-z'-]+[^a-za-z'-]+){0,%d})(" % n regex_pre_post += regex_acronym regex_pre_post += ")((?:[^a-za-z'-]+[a-za-z'-]+){0,%d})" % n found= re.findall(regex_pre_post, test_string) print found found = found[0] # single match, this. pre = found[0] acro = found[1] post = found[2] print pre, acro, post
will give you:
69 dos (69, 72) [('file ... department of called (', 'dos', ') , more texts , more')] file ... department of called ( dos ) , more texts , more
Comments
Post a Comment