Ad

UnicodeDecodeError When Concatenating Strings

I've got the following little Python 2.7 script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower()

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result

where, as you can see, I first figure out the country where the "8.8.8.8" IP is located (this returns "us" - see below) and then I concatenate a short string to it which contains some Russian characters.

Result:

# ./script.py
us
Traceback (most recent call last):
   File "./script.py", line 12, in <module>
    result += "Роман"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Now, if I try the following instead

#!/usr/bin/python
# -*- coding: utf-8 -*-

result = "us"
print result
result += "Роман"
print result

Then everything's ok:

./script.py 
us
usРоман

Obviously then, the 'ret_country_iso()' function returns something different than the literal "us" string, my Python is too poor though to say.

How to correct the above?

EDIT: following the advice of snakecharmerb, the following works:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower().encode('utf-8')

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result
Ad

Answer

Python 2 does not strictly distinguish between unicode and bytes, so the results of concatenating the two types are inconsistent:

u'abc' + 'def'

succeeds, but

u'US' + 'Роман'

results in an exception. The usual approach - the "Unicode Sandwich" pattern - is to decode and encode string-type data at the edges of an application, and work only with unicode within the application (for applications which deal primarily with bytes the reverse pattern is adopted).

So, when combining str and unicode instances you can take either of these options:

# unicode result
u'US ' + 'Роман'.decode('utf-8')

# str result
u'US '.encode('utf-8') + 'Роман'

but the key is to be consistent throughout your code, otherwise you will end up with a lot of errors.

Python 3 is stricter about separating the two types; if possible you should consider using it both for better unicode handling and because Python 2 is no longer supported.

Ad
source: stackoverflow.com
Ad