UnicodeDecodeError When Concatenating Strings
I've got the following little Python 2.7 script:
#!/usr/bin/python # -*- coding: utf-8 -*- import geoip2.database def ret_country_iso(ip): reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb') response = reader.country(ip) return response.country.iso_code.lower() result = ret_country_iso("18.104.22.168") print result result += "Роман" print result
where, as you can see, I first figure out the country where the "22.214.171.124" IP is located (this returns "us" - see below) and then I concatenate a short string to it which contains some Russian characters.
# ./script.py us Traceback (most recent call last): File "./script.py", line 12, in <module> result += "Роман" UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
Now, if I try the following instead
#!/usr/bin/python # -*- coding: utf-8 -*- result = "us" print result result += "Роман" print result
Then everything's ok:
./script.py us usРоман
Obviously then, the 'ret_country_iso()' function returns something different than the literal "us" string, my Python is too poor though to say.
How to correct the above?
EDIT: following the advice of snakecharmerb, the following works:
#!/usr/bin/python # -*- coding: utf-8 -*- import geoip2.database def ret_country_iso(ip): reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb') response = reader.country(ip) return response.country.iso_code.lower().encode('utf-8') result = ret_country_iso("126.96.36.199") print result result += "Роман" print result
Python 2 does not strictly distinguish between unicode and bytes, so the results of concatenating the two types are inconsistent:
u'abc' + 'def'
u'US' + 'Роман'
results in an exception. The usual approach - the "Unicode Sandwich" pattern - is to decode and encode string-type data at the edges of an application, and work only with unicode within the application (for applications which deal primarily with bytes the reverse pattern is adopted).
So, when combining
unicode instances you can take either of these options:
# unicode result u'US ' + 'Роман'.decode('utf-8') # str result u'US '.encode('utf-8') + 'Роман'
but the key is to be consistent throughout your code, otherwise you will end up with a lot of errors.
Python 3 is stricter about separating the two types; if possible you should consider using it both for better unicode handling and because Python 2 is no longer supported.
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module