Unzipping Of Downloaded .gz File Fails Due To UnicodeDecodeError

- 1 answer

Trying to download a .gz file, unzip it to memory and then read line by line of the unzipped content.

import requests

r = requests.get(url)

with, 'rb') as f:
    '''Reading line by line'''

Now the headers look like this:

{'Date': 'Fri, 23 Aug 2019 07:19:28 GMT', 'Server': 'Apache', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'sameorigin', 'Referrer-Policy': 'no-referrer', 'X-Xss-Protection': '1', 'Last-Modified': 'Sat, 23 Jun 2018 09:21:46 GMT', 'ETag': '"8be6ca-56f4bad760d07"', 'Accept-Ranges': 'bytes', 'Content-Length': '9168586', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Cache-Control': 'public, max-age=120', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/x-gzip'}

The error I get seems to be an encoding error, but requests has r.content in UTF-8 already I suppose and wants UTF-8, so I don't get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

There are similar questions on SO, which didn't help me.



According to [Python-Requests.2]: Developer Interface - class requests.Response.content (emphasis is mine):

Content of the response, in bytes.

On the other hand, [Python 3.Docs]:, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None):

The filename argument can be an actual filename (a str or bytes object), or an existing file object to read from or write to.

To get past this issue, you shouldn't pass r.content to, but either:

  • Wrap it by a io.BytesIO like object, and pass that to

    with as f:
        # Your original code (that reads line by line)
  • Pass it to gzip.decompress:

    extracted = gzip.decompress(r.content)
    for line in extracted.split(b"\n"):
        # Process each line

    or (combining with previous bullet)

    with io.BytesIO(gzip.decompress(r.content)) as f:
        # Your original code (that reads line by line)
  • Save it into a file, and pass its name to (note that this is very slow, and can introduce other potential issues, as pointed out by @Aran-Fey):

    file_name = "content.gzip"
    with open(file_name, "wb") as f:
    with, 'rb') as f:
        # Your original code (that reads line by line)