python - Why does Scrapy Selector fail to parse all the tags? -


hi there , thats looking @ question,

the problem have scrapy selector seems not parse correctly tags of site.

    pp re.findall("meta.*",response.body) ['meta name="verify-v1" content="c4vnwz0wndkra4axtdz9iegotdhnazsnf0rvwxat9em=">\r',  'meta http-equiv="content-type" content="text/html" charset="utf-8" />\r',  'meta http-equiv="x-ua-compatible" content="ie=edge" />\r',  'meta name="wt.cg_n" content="part search" />\r',  'meta name="wt.cg_s" content="part detail" />\r',  'meta name="wt.ti" content="part detail" />\r',  'meta name="wt.z_page_type" content="ps" />\r',  'meta name="wt.z_page_sub_type" content="pd" />\r',  'meta name="wt.z_page_id" content="pd" />\r',  'meta name="wt.pn_sku" content=481-2&quot;x36yd-nd />\r',  'meta name="wt.z_part_id" content=1819153 />\r',  'meta name="wt.tx_e" content="v" />\r',  'meta name="wt.tx_u" content="1" />\r',  'meta name="wt.z_supplier_id" content=19 />\r',  'meta itemprop="productid" content="sku:481-2&quot;x36yd-nd" />\r',  'meta itemprop="name" content="481-2&quot;x36yd" />\r'] ipdb> pp response.xpath("//meta") [<selector xpath='//meta' data=u'<meta name="verify-v1" content="c4vnwz0w'>,  <selector xpath='//meta' data=u'<meta http-equiv="content-type" content='>,  <selector xpath='//meta' data=u'<meta http-equiv="x-ua-compatible" conte'>,  <selector xpath='//meta' data=u'<meta name="description" content=\'find 3'>] ipdb> 

i can't figure out why happening , why other tags don't parsed if exist on site?

thanks.

i've found beautifulsoup built-in html.parser handles particular markup better:

$ scrapy shell https://www.digikey.com/product-detail/en/481-2%22x36yd/481-2%22x36yd-nd/1819153 >>> bs4 import beautifulsoup >>> soup = beautifulsoup(response.body, "html.parser") >>> >>> pprint import pprint >>> pprint([meta["content"] meta in soup.find_all("meta")]) [u'c4vnwz0wndkra4axtdz9iegotdhnazsnf0rvwxat9em=',  u'text/html',  u'ie=edge',  u'find 3m 481-2"\x00x36yd (481-2"\x00x36yd-nd) @ digikey.  check stock , pricing, view product specifications, , order online.',  u'481-2"\x00x36yd, 3m, tape',  u'digi-key search engine',  u'part search',  u'part detail',  u'part detail',  u'ps',  u'pd',  u'pd',  u'481-2"x36yd-nd',  u'1819153',  u'v',  u'1',  u'19',  u'sku:481-2"x36yd-nd',  u'481-2"x36yd'] 

what can in scrapy project pass response.body through beautifulsoup html parser in middleware - "fixing" broken html beautifulsoup. not require changes spider have. here sample middleware implementation:


Comments

Popular posts from this blog

routing - AngularJS State management ->load multiple states in one page -

python - GRASS parser() error -

Swift game error message -