Ad

Using Scrapy To Extract Raw Html Content Of A Large(r) Number Of Landing Pages

- 1 answer

For a classification project I need the raw html content of roughly 1000 websites. I only need the landing page and not more, so the crawler does not have to follow links! I want to use scrapy for it but I can't get the code together. Because I read in the documentation that JSON Files are first stored in memory and then saved (which can cause Problems when crawling a large number of pages) I want to save the file in the '.js' format. I use the Anaconda promt to execute my code.

I want the resulting file to have two columns, one with the domainname, and the second with the raw_html content on every site

domain, html_raw
 ..., ...
 ..., ...

I found many Spider Examples but I cant figure out how to put everything together. This is how far I got :(

Start the Project:

scrapy startproject dragonball

The actuall Spider (which might be completely wrong):

import scrapy

class DragonSpider(scrapy.Spider):
    name = "dragonball"

    def start_requests(self):
        urls = [
            'https://www.faz.de',
            'https://www.spiegel.de',
            'https://www.stern.de',
            'https://www.brandeins.de',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        ???

I navigate to the dragonball Folder and execute the file with:

scrapy crawl dragonball -o dragonball.jl

Every help would be apreciated :)

Ad

Answer

If you really want to store everything in a single file, then you can use the following (including part of vezunchik's answer):

    def parse(self, response):
        yield {
            'domain': urlparse(response.url).netloc,
            'html_raw': response.body.decode('utf-8'),
        }

As mentioned, this isn't a good idea in the long run as you'll end up with a huge file.

Ad
source: stackoverflow.com
Ad