python - HTML parsing with lxml - how to keep empty content in resulting list? -


i using lxml parse html file:

from lxml import html  tree = html.parse(myfile) data = tree.xpath('//p/text()') 

i have 300 <p>text</p> tags in html file, len(data) 250 because i'll have <p></p> in html. want these included in data either 'nan' or ''.

any suggestions on how this?

//p/text() find texts of p elements having non-empty text.

instead, find p elements , call .text_content() each one:

data = [p.text_content() p in tree.xpath('//p')] 

to demonstrate difference:

>>> lxml import html >>>  >>>  >>> data = """ ... <p>text1</p> ... <p></p> ... <p>text2</p> ... """ >>>  >>> tree = html.fromstring(data) >>> data = tree.xpath('//p/text()') >>> len(data) 2 >>>  >>> data = [p.text_content() p in tree.xpath('//p')] >>> len(data) 3 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -