python - HTML parsing with lxml - how to keep empty content in resulting list? -
i using lxml parse html file:
from lxml import html tree = html.parse(myfile) data = tree.xpath('//p/text()')
i have 300 <p>text</p>
tags in html file, len(data)
250 because i'll have <p></p>
in html. want these included in data
either 'nan'
or ''
.
any suggestions on how this?
//p/text()
find texts of p
elements having non-empty text.
instead, find p
elements , call .text_content()
each one:
data = [p.text_content() p in tree.xpath('//p')]
to demonstrate difference:
>>> lxml import html >>> >>> >>> data = """ ... <p>text1</p> ... <p></p> ... <p>text2</p> ... """ >>> >>> tree = html.fromstring(data) >>> data = tree.xpath('//p/text()') >>> len(data) 2 >>> >>> data = [p.text_content() p in tree.xpath('//p')] >>> len(data) 3
Comments
Post a Comment