python - Scrapy: Stop previous parse function on condition -

i have specific situation 1 scraper developing right now. first function parse_posts_pages iterates through pages specific forum page , each page, calls second function parse_posts.

def parse_posts_pages(self, response):     thread_id = response.meta['thread_id']     thread_link = response.meta['thread_link']     thread_name = response.meta['thread_name']     if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:         posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])         total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])         if posts_per_page > 0:             post_mod = total_posts % posts_per_page             pages = total_posts / posts_per_page             if post_mod > 0: pages += 1         else: pages = 1      page in range(pages, 0, -1):         cur_page = '' if page == 1 else '/page' + str(page)         post_page_link = thread_link + cur_page         return scrapy.request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})   def parse_posts(self, response):     global maxpostidbythread, executefullspider     thread_id = response.meta['thread_id']     thread_name = response.meta['thread_name']     post in response.xpath('//*[@id="posts"]/li'):         post_id = post.xpath('@id').re(r'(\d.*)')[0]         if not executefullspider , post_id in maxpostidbythread:             break #<- need break cancel parse_posts_pages function         ...

in second function there if condition. when conditions resolves true need break current loop , loop parse_posts_pages there no need continue pagination.

is there way stop loop in first function second function?

just raise closespider, described in manual

how can instruct spider stop itself?

raise closespider callback.

from scrapy.exceptions import closespider  def parse_page(self, response):     if 'bandwidth exceeded' in response.body:         raise closespider('bandwidth_exceeded')

http://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.closespider

note requests still in progress (http request sent, response not yet received) still parsed. no new request processed though.

https://stackoverflow.com/a/23895143/5041915

update: found out interesting if stop spider in main function.

it may happen new valid worker not have time start because raise exception works faster.

i suggest checking condition in call-back function , raise exception possible.

Search This Blog

Ben

python - Scrapy: Stop previous parse function on condition -

Comments

Post a Comment

Popular posts from this blog

routing - AngularJS State management ->load multiple states in one page -

python - GRASS parser() error -

Swift game error message -