Ad

Multiple Errors Importing Huge CSV. How To Diagnose?

- 1 answer

I have a csv with over 1 million rows that I'm trying to import. Unfortunately I can't share a sample of the data but this is the code I'm using to import it:

transactions = pd.read_csv('bank_raw_data.csv',
                           sep=',',
                           error_bad_lines=False,
                           warn_bad_lines=True,
                           engine='python',
                           encoding='ISO-8859-1',
                           escapechar='\\',
                           skiprows=[i for i in range(1,263)])

I skip rows that have errors, and below is a section of errors I'm getting:

Skipping line 1294103: ',' expected after '"'
Skipping line 1300423: field larger than field limit (131072)
Skipping line 1300695: NULL byte detected. This byte cannot be processed in Python's native csv library at the moment, so please pass in engine='c' instead
Skipping line 1294273: Expected 21 fields in line 1294273, saw 31

Unfortunately I can't check the csv in Excel due to it's size so I don't know whats' going on in line 12455 etc.

Any advice on how to diagnose these errors?

I have also changed encoding to encoding='cp1252' but get the error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4082: character maps to <undefined>

The reason I tried cp1252 as the encoding is this:

with open('bank_raw_data.csv') as f:
    print(f)

<_io.TextIOWrapper name='bank_raw_data.csv' mode='r' encoding='cp1252'>

But it fails.

Ad

Answer

You can check the specific line through:

Powershell

Get-Content filename.csv | Select -Index x-1

Note Select starts on 0, so to read line 10 you'd write -Index 9

Bash

cat filename.csv | awk 'NR==x'

Ad
source: stackoverflow.com
Ad