Ad

UnicodeEncodeError: 'ascii' Codec Can't Encode Character '\xe9' When Printing In UTF-8 Locale

I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I downloaded using wget). I want to extract the text and see how it looks like in order to further process the corpus.

Using the following code to extract the text from gzip, I obtained data with the class being bytes.

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read()
    print('type(text)=', type(text))

The printed results for several first lines are as follows:

type(f_in) = class 'gzip.GzipFile'

type(text)= class 'bytes'

b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 interrompue le vendredi 17 d\xc3\xa9cembre dernier et je vous renouvelle tous mes vux en esp\xc3\xa9rant que vous avez pass\xc3\xa9 de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit.\n

I tried to decode the binary data with utf8 and ascii with the following code:

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read().decode('utf8')
    print('type(text)=', type(text))

And it returned errors like this:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 26: ordinal not in range(128)

I also tried using codecs and unicodedata packages to open the file but it returned encoding error as well.

Could you please help me explain what I should do to get the French text in the correct format like this for example?

Reprise de la session\nJe déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit.\n

Thank you a ton for your help!

Ad

Answer

The UnicodeEncodeError is occurring because when printing, Python encodes strings to bytes, but in this case, the encoding being used - ASCII - has no character that matches '\xe9', so the error is raised.

Setting the PYTHONIOENCODING environment variable forces Python to use a different encoding - the value of the environment variable. The UTF-8 encoding can encode any character, so calling the program like this solves the issue:

PYTHONIOENCODING=UTF-8 python3  europarl_extractor.py

assuming the code is something like this:

import gzip

if __name__ == '__main__':
    with gzip.open('europarl-v7.fr.gz', 'rb') as f_in:
        bs = f_in.read()
        txt = bs.decode('utf-8')
        print(txt[:100])

The environment variable may be set in other ways - via an export statement, in .bashrc, .profile etc.

An interesting question is why Python is trying to encode output as ASCII. I had assumed that on *nix systems, Python essentially looked at the $LANG environment variable to determine the encoding to use. But in the case the value of $LANG is fr_FR.UTF-8, and yet Python is using ASCII as the output encoding.

From looking at the source for the locale module, and this FAQ, these environment variables are checked, in order:

'LC_ALL', 'LC_CTYPE', 'LANG', 'LANGUAGE'

So it may be that one of LC_ALL or LC_CTYPE has been set to a value that mandates ASCII encoding in your environment (you can check by running the locale command in your terminal; also running locale charmap will tell you the encoding itself).

Ad
source: stackoverflow.com
Ad