regex - Python Find n words before and after a certain words -

lets have text file. should read , like:

 ... department of called (dos) , more texts , more text...

and "while" reading text file find acronym, here

dos

so finding acronym wrote:

import re import numpy  # open file?  test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex = r'\b[a-z][a-za-z\.]*[a-z]\b\.?'  found= re.findall(regex, test_string) print found

and output is:

['dos']

what want is:

while reading file , find , acronym (here dos),
calculate number of characters of found (here 3 chars dos)

find 2 times (here 2x3=6) words before , after 'dos'. here be:

3.1 pre=     department of called 3.2 acronym= dos 3.3 post=    , more texts , more

put these 3 (pre, acronym, post) in array.

any appreciated since new python.

not sure if best solution, maybe it's enough you.

import re import numpy  # open file?  test_string = " lot of text read file ... department of called (dos) , more texts , more text..." regex_acronym = r'\b[a-z][a-za-z\.]*[a-z]\b\.?'  ra = re.compile(regex_acronym) m in ra.finditer(test_string):     print m.start(), m.group(), m.span()     n = len(m.group()) * 2     regex_pre_post = r"((?:[a-za-z'-]+[^a-za-z'-]+){0,%d})(" % n     regex_pre_post += regex_acronym      regex_pre_post += ")((?:[^a-za-z'-]+[a-za-z'-]+){0,%d})" % n     found= re.findall(regex_pre_post, test_string)     print found      found = found[0] # single match, this.     pre = found[0]     acro = found[1]     post = found[2]     print pre, acro, post

will give you:

69 dos (69, 72) [('file ... department of called (', 'dos', ') , more texts , more')] file ... department of called ( dos ) , more texts , more

Search This Blog

Ben

regex - Python Find n words before and after a certain words -

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

post - imageshack API cURL -

dataset - MPAndroidchart returning no chart Data available -