Ad

Same Character, Different Length And Bytes

- 1 answer

Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:

>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>

Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:

  1. What is the difference between these two strings?
  2. Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
  3. And how can I fix this? (e.g. convert second_string to the likeness of first_string)

Thank you.

Ad

Answer

  1. An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.

    The first one is:

    <â> 226, Hex 00e2, Octal 342
    

    And the second:

    <a>  97,  Hex 61,  Octal 141 < ̂> 770, Hex 0302, Octal 1402
    

    In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.

  2. Ask the website operators. How would we know?!

  3. You need something which turns combining characters into regular characters. A Google search yielded this question, for example.

    As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.

Ad
source: stackoverflow.com
Ad