Problem With Comparing 2 DataFrames In Python, Should Exclude All Duplicates, But Works Unproperly
I'm developing a connector Google Analytics to SQL Server database and have a problem with duplicate values.
First, the script parses a nested dict with GA Accounts config, then converts it to pandas df and stores all responses in a list, after that a current SQL table with all GA Data is fetched and a loop comparing new values (from GA API) and current values (in sql table) is made.
But for some reason when comparing these 2 dfs all duplicates are preserved.
I would be really happy if someone could help.
A nested dict with config for making a GA API request
data_test = {
'view_id_111' : {'view_id': '111',
'start_date': '2019-08-01',
'end_date': '2019-09-01',
'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
'dimensions': [{'name': 'ga:country'}, {'name': 'ga:userType'}, {'name': 'ga:date'}]},
'view_id_222' : {'view_id': '222',
'start_date': '2019-08-01',
'end_date': '2019-09-01',
'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
'dimensions': [{'name': 'ga:country'}, {'name': 'ga:date'}]},
'view_id_333' : {'view_id': '333',
'start_date': '2019-01-01',
'end_date': '2019-05-01',
'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
'dimensions': [{'name': 'ga:country'}, {'name': 'ga:date'}]}
}
- Send request to Google API, convert it to df and store values in a list
responses = []
for k, v in data_test.items():
sample_request = {
'viewId': v['view_id'],
'dateRanges': {
'startDate': v['start_date'],
'endDate': v['end_date']
},
'metrics': v['metrics'],
'dimensions': v['dimensions']
}
response = analytics.reports().batchGet(
body={
'reportRequests': sample_request
}).execute()
n_response=print_response_new_test(response)
responses.append(n_response)
- Fetch current SQL table with GA data
def get_current_sql_gadata_table():
global sql_table_current_gadata
sql_table_current_gadata = pd.read_sql('SELECT * FROM Table', con=conn)
sql_table_current_gadata['date'] = pd.to_datetime(sql_table_current_gadata['date'])
return sql_table_current_gadata
- Finally compare 2 DFs and if there are any differences between, update the SQL table
def compare_df_gadata():
for report in responses:
response=pd.DataFrame.equals(sql_table_current_gadata, report)
if response==False:
compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
compared_dfs.drop_duplicates(keep=False, inplace=True)
#sql params in sqlalchemy
params = urllib.parse.quote_plus(#params)
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(params))
#insert new values to the sql table
compared_dfs.to_sql('Table', con=engine, if_exists='append', index=False)
I've tried also merging 2 tables, but the result is the same. Maybe it's more reasonable to make a check-in MS Studio?
Doesn't work properly also
df_outer = pd.merge(sql_table_current_gadata, report, on=None, how='left', sort=True)
UPDATE
I checked another time with concat function and it looks like the problem is in 'index'.
The original 240 rows (960 were already with duplicates, so just cleaned SQL table and run the script again).
I have 3 GA accounts from which the current SQL table consists: 72 rows + 13 rows + 154 rows + header = 240 rows.
And when running the script again, comparing with pd.concat and storing the result in a dataframe (compared_dfs) (not sending it to the database), it contains the 154 rows from the last request to GA API.
I tried to reset here:
if response==False:
compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
compared_dfs.drop_duplicates(keep=False, inplace=True)
compared_dfs.reset_index(inplace=True)
But as a result it was added as an additional column in a compared_dfs
It shows 2 index columns, one from the SQL table and another from pandas
Answer
your question is detailed but quite articulated to follow. I would first ask if you are sure about your indexes, could you try merge on specific columns to see if this solves the problem? I'm focusing on the pandas part first as it seems the focus of your question.
import pandas as pd
import numpy as np
merge = True
concat = False
anp = np.ones((2, 5))
anp[1, 1] = 3
anp[1, 4] = 3
bnp = np.ones((1, 5))
bnp[0, 1] = 4 # use 4 to make it different, also works with nan
bnp[0, 4] = 4 # use 4 to make it different, also works with nan
a = pd.DataFrame(anp)
b = pd.DataFrame(bnp)
if merge:
a.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
b.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
# choose suitable and meaningful column(s) for your merge (do you have any id column etc.?)
a = pd.merge(a, b, how='outer', copy=False, on=['a', 'c', 'd', 'e'])
# che
print(a)
if concat:
# can use ignore_index or pass keys to maintain distiction
c = pd.concat((a, b), axis=0, join='outer', keys=['a', 'b'])
print(c)
c.drop_duplicates(inplace=True)
print(c)
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module