Ad

How To Remove Only Consistent Duplicate Rows, And Replace Values Based On Another Columns With Pandas?

- 1 answer

import pandas as pd

df = pd.DataFrame(
    [
        ['China', '08/06/2022 20:00', '08/10/2022 20:00'],
        ['China', '8/13/2022 00:54', '8/14/2022 00:54'],
        ['China', '8/14/2022 00:54', '8/14/2022 12:54'],
        ['United Kingdom', '8/27/2022 06:36', '8/31/2022 21:08'],
        ['United Kingdom', '9/01/2022 21:08', '09/02/2022 21:38'],
        ['China', '09/04/2022 21:38', '09/06/2022 21:38']
    ],
    columns=['Country', 'Arrival', 'Departure']
)

Supposed i have above dataframe, and i want to remove the consistent duplicate rows and replace the departure time with the last duplicates value, but do not remove other duplicates that are not in consistent manner.

so after removing it should look like this:

df = pd.DataFrame(
    [
        ['China', '08/06/2022 20:00', '8/14/2022 12:54'],
        ['United Kingdom', '8/27/2022 06:36', '09/02/2022 21:38'],
        ['China', '09/04/2022 21:38', '09/06/2022 21:38']
    ]
    columns=['Country', 'Arrival', 'Departure']
)
Ad

Answer

IIUC, use:

group = df['Country'].ne(df['Country'].shift()).cumsum()
out = (df.groupby(group, as_index=False)
         .agg({'Country': 'first', 'Arrival': 'first', 'Departure': 'last'})
       )

output:

          Country           Arrival         Departure
0           China  08/06/2022 20:00   8/14/2022 12:54
1  United Kingdom   8/27/2022 06:36  09/02/2022 21:38
2           China  09/04/2022 21:38  09/06/2022 21:38
Ad
source: stackoverflow.com
Ad