Ad

Read Xls File In Pandas / Python: Unsupported Format, Or Corrupt File: Expected BOF Record; Found B'\xef\xbb\xbf

- 1 answer

I am trying to open an xls file (with only one tab) into a pandas dataframe.

It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu .

I notice that the top two rows have merged cells and so do some of the columns.

I have tried several methods (from stack), which all fail.

# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)

error: Excel file format cannot be determined, you must specify an engine manually.

# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)

error: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)

error: File is not a zip file (possibly converting, as opposed to renaming, is an option).

# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)

this method actually yields a result, but produces a DataFrame that has no meaning in relation to the sheet. presumably one needs to interpret the xml (seems complex) ?

   Style       Name  Table
0    NaN       None    NaN
1    NaN  All funds    NaN


# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)

This method reads the file into a one column (series) DataFrame. So how could one use this info to create a standard 2d DataFrame in the same shape as the xls file ?

0       <Workbook xmlns="urn:schemas-microsoft-com:off...
1                                                <Styles>
2                                 <Style ss:ID="Default">
3                          <Alignment Horizontal="Left"/>
4                                                </Style>
...                                                   ...
226532                                            </Cell>
226533                                             </Row>
226534                                           </Table>
226535                                       </Worksheet>
226536                                        </Workbook>



# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)

this returns a blank list [] whereas one might have expected at least a list of DataFrames.

So the question is what is the easiest method to read this file into a dataframe (or similar usable format) ?

Ad

Answer

Not a complete solution but it should get you started. The "xls" file is actually a plain xml file in the SpreadsheetML format. Change the file extension to .xml an view it in your internet browser, the structure (at least of the give file) is rather straightforward.

The following reads the data contents into a pandas DataFrame:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('product-screener.xls')
root = tree.getroot()

data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]

df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
    if types[c] == 'Number':
        df[c] = pd.to_numeric(df[c])
    elif types[c] == 'DateTime':
        df[c] = pd.to_datetime(df[c])

Getting the column names from rows 0 and 1 is a bit more involved due to the merged cells - I leave it as an exercise for the reader 😊.

Ad
source: stackoverflow.com
Ad