Ad

Problem With Comparing 2 DataFrames In Python, Should Exclude All Duplicates, But Works Unproperly

I'm developing a connector Google Analytics to SQL Server database and have a problem with duplicate values.

First, the script parses a nested dict with GA Accounts config, then converts it to pandas df and stores all responses in a list, after that a current SQL table with all GA Data is fetched and a loop comparing new values (from GA API) and current values (in sql table) is made.

But for some reason when comparing these 2 dfs all duplicates are preserved.

I would be really happy if someone could help.

A nested dict with config for making a GA API request


data_test = {
    
    'view_id_111' : {'view_id': '111', 
                           'start_date': '2019-08-01', 
                           'end_date': '2019-09-01',
                           'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
                           'dimensions': [{'name': 'ga:country'}, {'name': 'ga:userType'}, {'name': 'ga:date'}]},
    
     'view_id_222' : {'view_id': '222', 
                           'start_date': '2019-08-01', 
                           'end_date': '2019-09-01',
                           'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
                           'dimensions': [{'name': 'ga:country'}, {'name': 'ga:date'}]},
    
    'view_id_333' : {'view_id': '333', 
                           'start_date': '2019-01-01', 
                           'end_date': '2019-05-01',
                           'metrics': [{'expression': 'ga:sessions'}, {'expression':'ga:users'}],
                           'dimensions': [{'name': 'ga:country'}, {'name': 'ga:date'}]} 
}

  1. Send request to Google API, convert it to df and store values in a list
responses = []

for k, v in data_test.items():
    
    sample_request = {
        'viewId': v['view_id'],
        'dateRanges': {
            'startDate': v['start_date'],
            'endDate': v['end_date']
        },
        'metrics': v['metrics'],
        'dimensions': v['dimensions']
    }
    
    response = analytics.reports().batchGet(
        body={
            'reportRequests': sample_request
        }).execute()
    
    n_response=print_response_new_test(response)
    responses.append(n_response)
  1. Fetch current SQL table with GA data
def get_current_sql_gadata_table():
    global sql_table_current_gadata
    sql_table_current_gadata = pd.read_sql('SELECT * FROM Table', con=conn)
    sql_table_current_gadata['date'] = pd.to_datetime(sql_table_current_gadata['date'])
    return sql_table_current_gadata
  1. Finally compare 2 DFs and if there are any differences between, update the SQL table

def compare_df_gadata():
    
    for report in responses:
        response=pd.DataFrame.equals(sql_table_current_gadata, report)
        if response==False:
            compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
            compared_dfs.drop_duplicates(keep=False, inplace=True)
    
            #sql params in sqlalchemy
            params = urllib.parse.quote_plus(#params)
            engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(params))

            #insert new values to the sql table
            compared_dfs.to_sql('Table', con=engine, if_exists='append', index=False)
          

I've tried also merging 2 tables, but the result is the same. Maybe it's more reasonable to make a check-in MS Studio?

Doesn't work properly also

df_outer = pd.merge(sql_table_current_gadata, report, on=None, how='left', sort=True)

UPDATE

I checked another time with concat function and it looks like the problem is in 'index'.

The original 240 rows (960 were already with duplicates, so just cleaned SQL table and run the script again).

I have 3 GA accounts from which the current SQL table consists: 72 rows + 13 rows + 154 rows + header = 240 rows.

And when running the script again, comparing with pd.concat and storing the result in a dataframe (compared_dfs) (not sending it to the database), it contains the 154 rows from the last request to GA API.

I tried to reset here:

if response==False:
            compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
            compared_dfs.drop_duplicates(keep=False, inplace=True)
            compared_dfs.reset_index(inplace=True)

But as a result it was added as an additional column in a compared_dfs

Resulted DF

It shows 2 index columns, one from the SQL table and another from pandas

Ad

Answer

your question is detailed but quite articulated to follow. I would first ask if you are sure about your indexes, could you try merge on specific columns to see if this solves the problem? I'm focusing on the pandas part first as it seems the focus of your question.

import pandas as pd
import numpy as np

merge = True
concat = False

anp = np.ones((2, 5))
anp[1, 1] = 3
anp[1, 4] = 3
bnp = np.ones((1, 5))
bnp[0, 1] = 4  # use 4 to make it different, also works with nan
bnp[0, 4] = 4  # use 4 to make it different, also works with nan
a = pd.DataFrame(anp)
b = pd.DataFrame(bnp)
if merge:
    a.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
    b.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
    # choose suitable and meaningful column(s) for your merge (do you have any id column etc.?)
    a = pd.merge(a, b, how='outer', copy=False, on=['a', 'c', 'd', 'e'])
    # che
    print(a)

if concat:
    # can use ignore_index or pass keys to maintain distiction
    c = pd.concat((a, b), axis=0, join='outer', keys=['a', 'b'])
    print(c)
    c.drop_duplicates(inplace=True)
    print(c)

Ad
source: stackoverflow.com
Ad