python - HTML parsing with lxml - how to keep empty content in resulting list? -

i using lxml parse html file:

from lxml import html  tree = html.parse(myfile) data = tree.xpath('//p/text()')

i have 300 <p>text</p> tags in html file, len(data) 250 because i'll have <p></p> in html. want these included in data either 'nan' or ''.

any suggestions on how this?

//p/text() find texts of p elements having non-empty text.

instead, find p elements , call .text_content() each one:

data = [p.text_content() p in tree.xpath('//p')]

to demonstrate difference:

>>> lxml import html >>>  >>>  >>> data = """ ... <p>text1</p> ... <p></p> ... <p>text2</p> ... """ >>>  >>> tree = html.fromstring(data) >>> data = tree.xpath('//p/text()') >>> len(data) 2 >>>  >>> data = [p.text_content() p in tree.xpath('//p')] >>> len(data) 3

Search This Blog

Ben

python - HTML parsing with lxml - how to keep empty content in resulting list? -

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

post - imageshack API cURL -

dataset - MPAndroidchart returning no chart Data available -