Ad

Unzipping Of Downloaded .gz File Fails Due To UnicodeDecodeError

- 1 answer

Trying to download a .gz file, unzip it to memory and then read line by line of the unzipped content.

import requests

r = requests.get(url)
print(r.headers)

with gzip.open(r.content, 'rb') as f:
    '''Reading line by line'''

Now the headers look like this:

{'Date': 'Fri, 23 Aug 2019 07:19:28 GMT', 'Server': 'Apache', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'sameorigin', 'Referrer-Policy': 'no-referrer', 'X-Xss-Protection': '1', 'Last-Modified': 'Sat, 23 Jun 2018 09:21:46 GMT', 'ETag': '"8be6ca-56f4bad760d07"', 'Accept-Ranges': 'bytes', 'Content-Length': '9168586', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Cache-Control': 'public, max-age=120', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/x-gzip'}

The error I get seems to be an encoding error, but requests has r.content in UTF-8 already I suppose and gzip.open() wants UTF-8, so I don't get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

There are similar questions on SO, which didn't help me.

Ad

Answer

According to [Python-Requests.2]: Developer Interface - class requests.Response.content (emphasis is mine):

Content of the response, in bytes.

On the other hand, [Python 3.Docs]: gzip.open(filename, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None):

The filename argument can be an actual filename (a str or bytes object), or an existing file object to read from or write to.

To get past this issue, you shouldn't pass r.content to gzip.open, but either:

  • Wrap it by a io.BytesIO like object, and pass that to gzip.open:

    with gzip.open(io.BytesIO(fc)) as f:
        # Your original code (that reads line by line)
    
  • Pass it to gzip.decompress:

    extracted = gzip.decompress(r.content)
    for line in extracted.split(b"\n"):
        # Process each line
        print(line.decode())
    

    or (combining with previous bullet)

    with io.BytesIO(gzip.decompress(r.content)) as f:
        # Your original code (that reads line by line)
    
  • Save it into a file, and pass its name to gzip.open (note that this is very slow, and can introduce other potential issues, as pointed out by @Aran-Fey):

    file_name = "content.gzip"
    with open(file_name, "wb") as f:
        f.write(r.content)
    with gzip.open(r.content, 'rb') as f:
        # Your original code (that reads line by line)
    os.unlink(file_name)
    
Ad
source: stackoverflow.com
Ad