Ad

What Does The Position Number Mean In This UnicodeError : 'gbk' Codec Can't Decode Byte 0x91 In Position 2196”

- 1 answer

It often occurs to me when I try to manipulate data, for example "UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2196: illegal multibyte sequence".

I have found a way to bypass this error but my curiosity drives me to investigate what is in position 2196.

Here comes the question:

How to understand the number 2196? I mean what encoding should I use when I counting from 1,2,...,2196. utf-8? gbk? binary? hex or sth else? And how can I see the number in that position without throwing error?

Here is a code portion as an example:

with open(r"G:\ETCData\6aMTC\2019-06-01.txt", "r") as fp:
    for i, line in enumerate(fp):
        if i == 6:
            pass

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-6810d8c84b34> in <module>()
      1 with open(r"G:\ETCData\6aMTC\2019-06-01.txt", "r") as fp:
----> 2     for i, line in enumerate(fp):
      3         if i == 6:
      4             pass

UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2196: illegal multibyte sequence
Ad

Answer

The problem here, is that the decoding function is called from the internals of the io system. 2196 is the position of the offending byte in the read buffer. If your file was short enough to be read in a single chunk, or if the error occured on the first chunk, then it is the position of the byte in the file. In any other case, it will be hard to guess the real position of the byte in the file.

BTW b'\x91' is the LEFT SINGLE QUOTATION MARK unicode U+2018 in the cp1252 charset.

Ad
source: stackoverflow.com
Ad