python - Why does Scrapy Selector fail to parse all the tags? -
hi there , thats looking @ question,
the problem have scrapy selector seems not parse correctly tags of site.
pp re.findall("meta.*",response.body) ['meta name="verify-v1" content="c4vnwz0wndkra4axtdz9iegotdhnazsnf0rvwxat9em=">\r', 'meta http-equiv="content-type" content="text/html" charset="utf-8" />\r', 'meta http-equiv="x-ua-compatible" content="ie=edge" />\r', 'meta name="wt.cg_n" content="part search" />\r', 'meta name="wt.cg_s" content="part detail" />\r', 'meta name="wt.ti" content="part detail" />\r', 'meta name="wt.z_page_type" content="ps" />\r', 'meta name="wt.z_page_sub_type" content="pd" />\r', 'meta name="wt.z_page_id" content="pd" />\r', 'meta name="wt.pn_sku" content=481-2"x36yd-nd />\r', 'meta name="wt.z_part_id" content=1819153 />\r', 'meta name="wt.tx_e" content="v" />\r', 'meta name="wt.tx_u" content="1" />\r', 'meta name="wt.z_supplier_id" content=19 />\r', 'meta itemprop="productid" content="sku:481-2"x36yd-nd" />\r', 'meta itemprop="name" content="481-2"x36yd" />\r'] ipdb> pp response.xpath("//meta") [<selector xpath='//meta' data=u'<meta name="verify-v1" content="c4vnwz0w'>, <selector xpath='//meta' data=u'<meta http-equiv="content-type" content='>, <selector xpath='//meta' data=u'<meta http-equiv="x-ua-compatible" conte'>, <selector xpath='//meta' data=u'<meta name="description" content=\'find 3'>] ipdb> i can't figure out why happening , why other tags don't parsed if exist on site?
thanks.
i've found beautifulsoup built-in html.parser handles particular markup better:
$ scrapy shell https://www.digikey.com/product-detail/en/481-2%22x36yd/481-2%22x36yd-nd/1819153 >>> bs4 import beautifulsoup >>> soup = beautifulsoup(response.body, "html.parser") >>> >>> pprint import pprint >>> pprint([meta["content"] meta in soup.find_all("meta")]) [u'c4vnwz0wndkra4axtdz9iegotdhnazsnf0rvwxat9em=', u'text/html', u'ie=edge', u'find 3m 481-2"\x00x36yd (481-2"\x00x36yd-nd) @ digikey. check stock , pricing, view product specifications, , order online.', u'481-2"\x00x36yd, 3m, tape', u'digi-key search engine', u'part search', u'part detail', u'part detail', u'ps', u'pd', u'pd', u'481-2"x36yd-nd', u'1819153', u'v', u'1', u'19', u'sku:481-2"x36yd-nd', u'481-2"x36yd'] what can in scrapy project pass response.body through beautifulsoup html parser in middleware - "fixing" broken html beautifulsoup. not require changes spider have. here sample middleware implementation:
Comments
Post a Comment