Scrapy Crawl Information Another Site
I want to scrape data from sites x1,...,xn and for every x, I scrape 10 inside sites.
I parse my sites one by one:
for el in get_data(): yield scrapy.Request(el, self.parse)
then for each site x I set some properties:
self.site_id += 1 self.link_id = response.url self.status = -9999 self.current_link = "" self.link_img = "" self.pattern_id = -9999 self.found_image = False
go to another parser for 10 sites after some checks:
yield scrapy.Request(link_to_next_page, self.parse_inner_page) def parse_inner_page(self, response): ... yield LogoScrapeItem(site_id=self.site_id, link_id=self.link_id, status=self.status, current_link=self.current_link, pattern_id=self.pattern_id, found_img=self.found_img, link_img=self.link_img )
I thought that the process is straight forward: We get site x, for this site x, we scrape 10 inside pages, then we proceed with another x etc.
QUESION: Why is my data mixed up? a data entry might have a link from one site and image from another?
Maybe it's the async feature of scrapy:
This is what I expect:
parse_site -> parse_page_within -> parse_page_within-> parse_page_within-> parse_page_within-> parse_page_within
This is what may be the reality:
parse_site -> parse_site -> parse_site -> parse_page_within-> parse_page_within-> parse_page_within
Another possible solution:
Maybe if I am able to send some data to the other parse method with some of these variables, though the callback function, I would not be bound to rely on the current state of the class variables
How do I scrape my pages one by one? If this is my problem, how do I stop the async thing, if not what should I do?
How can I debug scrapy to see step by step what parse function is being executed after one another?
For debugging purposes you can create a main.py with the following content in your scraper's folder:
from scrapy import cmdline cmdline.execute("scrapy crawl nameofyourspider".split())
then execute it using debug/breakpoint mechanism in your IDE.
For the main thing, to put together scraped data in the right manner, I suggest you to use the
Request.meta attribute to pass values to the requests.
From the official guide:
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item yield request def parse_page2(self, response): item = response.meta['item'] item['other_url'] = response.url yield item
With this, you can easily associate the outer pages to the inner pages even in asynchronous mode, making easy to export results the right way.
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module