Ad

How To Fix These Errors Arising Due To Server Blocking Web Scraping?

- 1 answer

I am trying to get the text from a web page using the "get_text" function as described here.

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

This works fine for this particular website, but when I try to scrape from another website, I get the 403 Error:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

This gives the following error in the line html = urllib.request.urlopen(url).read().decode('utf-8'):

HTTPError: HTTP Error 403: Forbidden

I tried to fix it by specifying a user agent as follows:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)

but I get the following error:

TypeError: urlopen() got an unexpected keyword argument 'headers'

Since the error said headers is undefined for urlopen, I tried to specify the user agent with the requests module as follows:

from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))

But this gives the following error:

AttributeError: 'Response' object has no attribute 'strip'

How do I get this damn server to stop blocking my web crawls please?

Ad

Answer

You need to process the body of the response, not the response object itself:

response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(response.text))
Ad
source: stackoverflow.com
Ad