regex - Python Find n words before and after a certain words -


lets have text file. should read , like:

 ... department of called (dos) , more texts , more text... 

and "while" reading text file find acronym, here

dos  

so finding acronym wrote:

import re import numpy  # open file?  test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex = r'\b[a-z][a-za-z\.]*[a-z]\b\.?'  found= re.findall(regex, test_string) print found 

and output is:

['dos'] 

what want is:

  1. while reading file , find , acronym (here dos),
  2. calculate number of characters of found (here 3 chars dos)
  3. find 2 times (here 2x3=6) words before , after 'dos'. here be:

    3.1 pre=     department of called 3.2 acronym= dos 3.3 post=    , more texts , more  
  4. put these 3 (pre, acronym, post) in array.

any appreciated since new python.

not sure if best solution, maybe it's enough you.

import re import numpy  # open file?  test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex_acronym = r'\b[a-z][a-za-z\.]*[a-z]\b\.?'  ra = re.compile(regex_acronym) m in ra.finditer(test_string):     print m.start(), m.group(), m.span()     n = len(m.group()) * 2     regex_pre_post = r"((?:[a-za-z'-]+[^a-za-z'-]+){0,%d})(" % n     regex_pre_post += regex_acronym      regex_pre_post += ")((?:[^a-za-z'-]+[a-za-z'-]+){0,%d})" % n     found= re.findall(regex_pre_post, test_string)     print found      found = found[0] # single match, this.     pre = found[0]     acro = found[1]     post = found[2]     print pre, acro, post 

will give you:

69 dos (69, 72) [('file ... department of called (', 'dos', ') , more texts , more')] file ... department of called ( dos ) , more texts , more 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -