python re.findall did not find all -
content='<tr><td style="text-align:center;" height="30">12090043</td>'+\ '<td style="text-align:left;">coursea</td>'+\ '<td style="text-align:center;">3</td>'+\ '<td style="text-align:left;">86</td><td>2013-summer</td></tr>'+\ '<tr><td style="text-align:center;" height="30">10420844</td>'+\ '<td style="text-align:left;">courseb</td>'+\ '<td style="text-align:center;">4</td>'+\ '<td style="text-align:left;">98</td><td>2013-autumn</td></tr>' pattern=re.compile('<tr>.*"30">(.*)</td>.*"text-align:left;">(.*)</td>.*"text-align:center;">(.*)</td>.*"text-align:left;">(.*)</td><td>(.*)</td></tr>') items=re.findall(pattern,content) print items
the output is:
[('10420844', 'courseb', '4', '98', '2013-autumn')]
but expected result is:
[('12090043', 'coursea', '3', '86', '2013-summer'),('10420844', 'courseb', '4', '98', '2013-autumn')]
actually code returns last match, if there more 2 matches. can tell me why happening? sorry long code , in advance!
you can beautifulsoup below:
>>> bs4 import beautifulsoup >>> content = """ ... <tr> ... <td style="text-align:center;" height="30">12090043</td> ... <td style="text-align:left;">coursea</td> ... <td style="text-align:center;">3</td> ... <td style="text-align:left;">86</td><td>2013-summer</td> ... </tr> ... ... <tr> ... <td style="text-align:center;" height="30">10420844</td> ... <td style="text-align:left;">courseb</td> ... <td style="text-align:center;">4</td> ... <td style="text-align:left;">98</td><td>2013-autumn</td> ... </tr> ... """ >>> >>> soup = beautifulsoup(content, "html.parser") >>> [i.get_text(' ').split() in soup.find_all('tr')] [['12090043', 'coursea', '3', '86', '2013-summer'], ['10420844', 'courseb', '4', '98', '2013-autumn']]
regex isn't correct tool parse html. don't try debug code, instead, totally drop , use html parser above example (beautifulsoup).
Comments
Post a Comment