Ad

UnicodeEncodeError: 'charmap' Codec Can't Encode Character '\ufeff' In Position 0: Character Maps To

- 1 answer

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).

I get this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to

So I tried print(pred_str.encode('utf-8')) and my output looks like this:

b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m' b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham' b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5' b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'

But, I want my output to look like this:

pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām aviparīta-pudgala-dharma-nairātmya-pratipādana-artham triṃśikā-vijñapti-prakaraṇa-ārambhaḥ pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham

If i save my string in file using:

with codecs.open('out.txt', 'w', 'UTF-8') as f:
    f.write(pred_str)

it saves string as expected.

Ad

Answer

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.

This variant of UTF-8 prefixes encoded text with a byte order mark'\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.

You can decode such bytestrings like this:

>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)                                                                                                         
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām 

To read such data from a file:

with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
    text = f.read()

Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

Ad
source: stackoverflow.com
Ad