Ad

There Are Weird Characters Even Though It's Encoded Utf-8

I spent my last 3 hours to solve this problem even though there are plenty of solutions. It just doesn't work for me, I suspicious of if the website that I'm scrapping is corrupted but Firefox shows the content perfectly.. As I said, this is asked before but I think there is a difference for my code and I want to learn what it is.

from bs4 import BeautifulSoup
import requests

html_text = requests.get('link_for_scrapping').text

soup = BeautifulSoup(html_text, 'lxml')
print(soup.encoding)
soup.encoding = 'utf-8'
print(soup.encoding)

Output:

None
utf-8

Why is it encoded as "None" at first? The content I'm looking for is written with Turkish characters, but in other people's code, they weren't encoded as "None". They were something like "ISO-xxxx-x" or something else

Also, when I converted it to "utf-8" nothing changes. There are still same weird characters.

If we add this code, we can see it better:

menu = soup.find(class_="panel-grid-cell col-md-6").text
print(menu)

Output:

None
utf-8
1) 31.01.2022 Pazartesi Yemekler : 
Mercimek Ãorba Fırın Patates Mor Dünya Salatası Sıhhiye Kırmızı Lahana Havuç Salata Elma *Etsiz PatatesKalori : 1099

If I change the encoding to utf-8 or not, problem persists.

Expected Output:

None
utf-8
1) 31.01.2022 Pazartesi Yemekler : 
Mercimek Çorba Fırın Patates Mor Dünya Salatası Sıhhiye Kırmızı Lahana Havuç Salata Elma *Etsiz PatatesKalori : 1099

Thanks in advance!

Ad

Answer

The Problem:

import requests
r = requests.get('link')
print(r.encoding)

Output: ISO-8859-1

The server is not sending the appropriate header, requests doesn't parse <meta charset="utf-8" />, so it defaults to ISO-8859-1.

Solution 1: Tell requests what encoding to use

r.encoding = 'utf-8'
html_text = r.text

Solution 2: Do the decoding yourself

html_text = r.content.decode('utf-8')

Solution 3: Have requests take a guess

r.encoding = r.apparent_encoding
html_text = r.text

In any case, html_text will now contain the (correctly decoded) html source and can be fed to BeautifulSoup.

The encoding setting of BeautifulSoup didn't help, because at that point you already had a wrongly decoded string!

Ad
source: stackoverflow.com
Ad