Ad

Grabbing The Content Inside The Html Content Using Python

The Chinese website here mainly describes the information of one company. Since there are many pages containing similar contents, I decided to learn data crawler in Python.

Basic code

import requests
from bs4 import BeautifulSoup
page = requests.get('http://182.148.109.184/enterprise- 
info!getCompanyInfo.action?companyid=1000356')

soup = BeautifulSoup(page.text, 'html.parser')
source_content = soup.find(class_='rightSide').find(class_='content register').find(class_='formestyle')

The information I want to collect

The figure was captured in Chrome inspect element page.

enter image description here

Maybe Chinese is not friendly here, I created an example here for better illustration.

<th> the variable name </th> => For example, "company name", "company location"
<td> the target data I want to save </td>

My question

Based on my basic code, the source_content contain no information inside . The output file was shown like this:

enter image description here

Comparing fig1, 2, we can see that the information of longitude, latitude has gone.

How to get those data with Python? Any advice would be appreciated

Ad

Answer

The information can be obtained if you provide a Referer header in your request as follows:

import requests
from bs4 import BeautifulSoup

url = 'http://182.148.109.184/enterprise-info!getCompanyInfo.action?companyid=1000356'
page = requests.get(url, headers={'Referer' : url})
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find(class_='formestyle')

for tr in table.find_all('tr'):
    row = [v.text for v in tr.find_all(['th', 'td'])]
    print(row)

This would display the following type of data:

['地理坐标:', '经度:104.2153 \xa0\xa0纬度:31.3631']

As you can see, the information is now present.

Ad
source: stackoverflow.com
Ad