Ad

Python Unicode Accent A (à) Hex

- 1 answer

I have a string from bs4 that is

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"

\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as

vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

with

str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))

but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.

The way it is supposed to be is

311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

EDIT: Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)

After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).

Ad

Answer

Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')

gives:

'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'

It works for the same reason as explained in this answer.

Ad
source: stackoverflow.com
Ad