Ad

Scrapy Crawl Information Another Site

source:https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/spiders/test_spider.py

I want to scrape data from sites x1,...,xn and for every x, I scrape 10 inside sites.

I parse my sites one by one:

for el in get_data():
            yield scrapy.Request(el, self.parse)

then for each site x I set some properties:

    self.site_id += 1
    self.link_id = response.url
    self.status = -9999
    self.current_link = ""
    self.link_img = ""
    self.pattern_id = -9999
    self.found_image = False

go to another parser for 10 sites after some checks:

yield scrapy.Request(link_to_next_page, self.parse_inner_page)

def parse_inner_page(self, response):
...
yield LogoScrapeItem(site_id=self.site_id,
                                     link_id=self.link_id,
                                     status=self.status,
                                     current_link=self.current_link,
                                     pattern_id=self.pattern_id,
                                     found_img=self.found_img,
                                     link_img=self.link_img
                                     )

I thought that the process is straight forward: We get site x, for this site x, we scrape 10 inside pages, then we proceed with another x etc.

QUESION: Why is my data mixed up? a data entry might have a link from one site and image from another?

Possible solutions:

Maybe it's the async feature of scrapy:

This is what I expect:

parse_site -> parse_page_within -> parse_page_within-> parse_page_within-> parse_page_within-> parse_page_within

This is what may be the reality:

parse_site -> parse_site -> parse_site -> parse_page_within-> parse_page_within-> parse_page_within

Another possible solution:

Maybe if I am able to send some data to the other parse method with some of these variables, though the callback function, I would not be bound to rely on the current state of the class variables

How do I scrape my pages one by one? If this is my problem, how do I stop the async thing, if not what should I do?

How can I debug scrapy to see step by step what parse function is being executed after one another?

Ad

Answer

For debugging purposes you can create a main.py with the following content in your scraper's folder:

from scrapy import cmdline
cmdline.execute("scrapy crawl nameofyourspider".split())

then execute it using debug/breakpoint mechanism in your IDE.

For the main thing, to put together scraped data in the right manner, I suggest you to use the Request.meta attribute to pass values to the requests.

From the official guide:

In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.

Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

With this, you can easily associate the outer pages to the inner pages even in asynchronous mode, making easy to export results the right way.

Ad
source: stackoverflow.com
Ad