Read Xls File In Pandas / Python: Unsupported Format, Or Corrupt File: Expected BOF Record; Found B'\xef\xbb\xbf
I am trying to open an xls
file (with only one tab) into a pandas dataframe.
It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu .
I notice that the top two rows have merged cells and so do some of the columns.
I have tried several methods (from stack), which all fail.
# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)
error: Excel file format cannot be determined, you must specify an engine manually.
# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)
error: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)
error: File is not a zip file
(possibly converting, as opposed to renaming, is an option).
# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)
this method actually yields a result, but produces a DataFrame that has no meaning in relation to the sheet. presumably one needs to interpret the xml (seems complex) ?
Style Name Table
0 NaN None NaN
1 NaN All funds NaN
# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)
This method reads the file into a one column (series) DataFrame. So how could one use this info to create a standard 2d DataFrame in the same shape as the xls file ?
0 <Workbook xmlns="urn:schemas-microsoft-com:off...
1 <Styles>
2 <Style ss:ID="Default">
3 <Alignment Horizontal="Left"/>
4 </Style>
... ...
226532 </Cell>
226533 </Row>
226534 </Table>
226535 </Worksheet>
226536 </Workbook>
# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)
this returns a blank list []
whereas one might have expected at least a list of DataFrames.
So the question is what is the easiest method to read this file into a dataframe (or similar usable format) ?
Answer
Not a complete solution but it should get you started. The "xls"
file is actually a plain xml
file in the SpreadsheetML
format. Change the file extension to .xml
an view it in your internet browser, the structure (at least of the give file) is rather straightforward.
The following reads the data contents into a pandas DataFrame:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('product-screener.xls')
root = tree.getroot()
data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]
df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
if types[c] == 'Number':
df[c] = pd.to_numeric(df[c])
elif types[c] == 'DateTime':
df[c] = pd.to_datetime(df[c])
Getting the column names from rows 0 and 1 is a bit more involved due to the merged cells - I leave it as an exercise for the reader 😊.
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module