Ad

Problems To Read German Csv File In Python

- 1 answer

I am having a german csv file, which I want to read with pd.read_csv.

Data:

The original file looks like this:

enter image description here

So it has two Columns (A,B) and the seperator should be ';',

Problem: When I ran the command:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep=';')

I get the error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

Half-Solution: I understand this could have several reasons, but when I ran the command:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep='delimiter')

I get the following dataset back:

    0
0   Etat;Die ARD-Tochter Degeto hat sich verpflich...
1   Etat;App sei nicht so angenommen worden wie ge...
2   Etat;'Zum Welttag der Suizidprävention ist es ...
3   Etat;Mitarbeiter überreichten Eigentümervertre...
4   Etat;Service: Jobwechsel in der Kommunikations...

so I only get one column instead of the two desired columns,

Target: any idea how to load the dataset correctly that I have:

    0       1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...

Hints/Tries:

When I run the search function over my data in excel, I am also not finding any ;in it.

It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example

Ad

Answer

One possible solution is create one column DataFrame with separator not in data like delimiter and then use Series.str.split with n parameter and expand=True for new DataFrame:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                       encoding='utf-8', header=None, sep='delimiter')

#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
#                      encoding='utf-8', header=None, sep='¥')

df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)
Ad
source: stackoverflow.com
Ad