python - Scrapy: Stop previous parse function on condition -
i have specific situation 1 scraper developing right now. first function parse_posts_pages iterates through pages specific forum page , each page, calls second function parse_posts.
def parse_posts_pages(self, response): thread_id = response.meta['thread_id'] thread_link = response.meta['thread_link'] thread_name = response.meta['thread_name'] if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3: posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1]) total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2]) if posts_per_page > 0: post_mod = total_posts % posts_per_page pages = total_posts / posts_per_page if post_mod > 0: pages += 1 else: pages = 1 page in range(pages, 0, -1): cur_page = '' if page == 1 else '/page' + str(page) post_page_link = thread_link + cur_page return scrapy.request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name}) def parse_posts(self, response): global maxpostidbythread, executefullspider thread_id = response.meta['thread_id'] thread_name = response.meta['thread_name'] post in response.xpath('//*[@id="posts"]/li'): post_id = post.xpath('@id').re(r'(\d.*)')[0] if not executefullspider , post_id in maxpostidbythread: break #<- need break cancel parse_posts_pages function ... in second function there if condition. when conditions resolves true need break current loop , loop parse_posts_pages there no need continue pagination.
is there way stop loop in first function second function?
just raise closespider, described in manual
how can instruct spider stop itself?
raise closespider callback.
from scrapy.exceptions import closespider def parse_page(self, response): if 'bandwidth exceeded' in response.body: raise closespider('bandwidth_exceeded') http://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.closespider
note requests still in progress (http request sent, response not yet received) still parsed. no new request processed though.
https://stackoverflow.com/a/23895143/5041915
update: found out interesting if stop spider in main function.
it may happen new valid worker not have time start because raise exception works faster.
i suggest checking condition in call-back function , raise exception possible.
Comments
Post a Comment