regex - Python Find n words before and after a certain words -
lets have text file. should read , like:
 ... department of called (dos) , more texts , more text... and "while" reading text file find acronym, here
dos  so finding acronym wrote:
import re import numpy  # open file?  test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex = r'\b[a-z][a-za-z\.]*[a-z]\b\.?'  found= re.findall(regex, test_string) print found and output is:
['dos'] what want is:
- while reading file , find , acronym (here dos),
- calculate number of characters of found (here 3 chars dos)
- find 2 times (here 2x3=6) words before , after 'dos'. here be: - 3.1 pre= department of called 3.2 acronym= dos 3.3 post= , more texts , more
- put these 3 (pre, acronym, post) in array.
any appreciated since new python.
not sure if best solution, maybe it's enough you.
import re import numpy  # open file?  test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex_acronym = r'\b[a-z][a-za-z\.]*[a-z]\b\.?'  ra = re.compile(regex_acronym) m in ra.finditer(test_string):     print m.start(), m.group(), m.span()     n = len(m.group()) * 2     regex_pre_post = r"((?:[a-za-z'-]+[^a-za-z'-]+){0,%d})(" % n     regex_pre_post += regex_acronym      regex_pre_post += ")((?:[^a-za-z'-]+[a-za-z'-]+){0,%d})" % n     found= re.findall(regex_pre_post, test_string)     print found      found = found[0] # single match, this.     pre = found[0]     acro = found[1]     post = found[2]     print pre, acro, post will give you:
69 dos (69, 72) [('file ... department of called (', 'dos', ') , more texts , more')] file ... department of called ( dos ) , more texts , more 
Comments
Post a Comment