There Are Weird Characters Even Though It's Encoded Utf-8
I spent my last 3 hours to solve this problem even though there are plenty of solutions. It just doesn't work for me, I suspicious of if the website that I'm scrapping is corrupted but Firefox shows the content perfectly.. As I said, this is asked before but I think there is a difference for my code and I want to learn what it is.
from bs4 import BeautifulSoup import requests html_text = requests.get('link_for_scrapping').text soup = BeautifulSoup(html_text, 'lxml') print(soup.encoding) soup.encoding = 'utf-8' print(soup.encoding)
Why is it encoded as "None" at first? The content I'm looking for is written with Turkish characters, but in other people's code, they weren't encoded as "None". They were something like "ISO-xxxx-x" or something else
Also, when I converted it to "utf-8" nothing changes. There are still same weird characters.
If we add this code, we can see it better:
menu = soup.find(class_="panel-grid-cell col-md-6").text print(menu)
None utf-8 1) 31.01.2022 Pazartesi Yemekler : Mercimek Ãorba FÄ±rÄ±n Patates Mor DÃ¼nya SalatasÄ± SÄ±hhiye KÄ±rmÄ±zÄ± Lahana HavuÃ§ Salata Elma *Etsiz PatatesKalori : 1099
If I change the encoding to utf-8 or not, problem persists.
None utf-8 1) 31.01.2022 Pazartesi Yemekler : Mercimek Çorba Fırın Patates Mor Dünya Salatası Sıhhiye Kırmızı Lahana Havuç Salata Elma *Etsiz PatatesKalori : 1099
Thanks in advance!
import requests r = requests.get('link') print(r.encoding)
The server is not sending the appropriate header,
requests doesn't parse
<meta charset="utf-8" />, so it defaults to ISO-8859-1.
Solution 1: Tell requests what encoding to use
r.encoding = 'utf-8' html_text = r.text
Solution 2: Do the decoding yourself
html_text = r.content.decode('utf-8')
Solution 3: Have requests take a guess
r.encoding = r.apparent_encoding html_text = r.text
In any case,
html_text will now contain the (correctly decoded) html source and can be fed to BeautifulSoup.
The encoding setting of
BeautifulSoup didn't help, because at that point you already had a wrongly decoded string!
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module