Ad

How To Remove All Conflicting Characters Between Latin1 And Utf-8 Using Python?

- 1 answer

I call open(file, "r") and read some lines in Python. This gives me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)

If I add 'utf-8', I get:

'utf8' codec can't decode bytes in position 28-29: invalid continuation byte

If I add 'ISO-8859-1', I get no errors but a line is read like this:

2890 ready to try Argh� Fantasy Surfer Carnage� Dane, Marlon & Nat C all out!  #fantasysurfer

As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..

What is the best approach to clean these lines up?

I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...

Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Argh� Carnage� elements.

In general, these are causing issues with the encoding.

Ad

Answer

Try first use decode and then encode:

u"text".decode('latin-1').encode('utf-8')

Or try open file with codecs:

import codecs
with codecs.open('file', encoding="your coding")

Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.

Also if you get text in ASCII use it:

'abc'.decode('ascii')

or

unicode('abc', 'ascii')
Ad
source: stackoverflow.com
Ad