Ad

Python Read String From File With Strange Encoding

I made a pig latin translator that takes input from the user, translates it, and returns it. I want to add the ability to input a text file to take text from but I'm running into an issue that the file isn't being opened as I expect. Here is my code:

from sys import argv
script, filename = argv

file = open(filename, "r")

sentence = file.read()

print sentence

file.close()

The problem is that when I print out the information inside the file it looks like this:

■T h i s   i s   s o m e   t e x t   i n   a   f i l e

Instead of this:

This is some text in a file

I know I could do a workaround the spaces and the odd square character with slicing, but I feel like that is treating a symptom and I want to understand why the text is formatted weird so maybe I can fix the cause.

Ad

Answer

I believe this is a Unicode UTF-16 encoded file, and this is the "Unicode Byte Order Mark" (BOM). It could also be another encoding with a byte-order mark, but it definitely appears to be a multi-byte encoding.

This is also why you're seeing the whitespace between characters. UTF-16 effectively represents each character as two bytes, but for standard ASCII characters like you're using, the other half of the character is empty (second byte is 0).

Try this instead:

from sys import argv
import codecs
script, filename = argv

file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()

Replace encoding='utf-16' with whatever encoding this actually is. You might just need to try a few and experiment.

Ad
source: stackoverflow.com
Ad