Ad

Efficient And Elegant Way To Fill Values In Pandas Column Based On Each Groups

df_new = pd.DataFrame(
{
    'person_id': [1, 1, 3, 3, 5, 5],
    'obs_date': ['12/31/2007', 'NA-NA-NA NA:NA:NA', 'NA-NA-NA NA:NA:NA', '11/25/2009', '10/15/2019', 'NA-NA-NA NA:NA:NA']
})

It looks like as shown below

enter image description here

What I would like to do is replace/fill NA type rows with actual date values from the same group. For which I tried the below

m1 = df_new['obs_date'].str.contains('^\d')
df_new['obs_date'] = df_new.groupby((m1).cumsum())['obs_date'].transform('first')

But this gives an unexpected output like shown below

enter image description here

Here for the 2nd row it should have been 11/25/2009 from person_id = 3 instead it is from the 1st group of person_id = 1.

How can I get the expected output as shown below

enter image description here

Any elegant and efficient solution is helpful as I am dealing with more than million records

Ad

Answer

First use to_datetime with errors='coerce' for convert non datetimes to missing values, then GroupBy.first for get first non missing value in GroupBy.transform new column filled by data:

df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('first')
#alternative - minimal value per group
#df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('min')

print (df_new)
   person_id   obs_date
0          1 2007-12-31
1          1 2007-12-31
2          3 2009-11-25
3          3 2009-11-25
4          5 2019-10-15
5          5 2019-10-15

Another idea is use DataFrame.sort_values with GroupBy.first:

df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = (df_new.sort_values(['person_id','obs_date'])
                            .groupby('person_id')['obs_date']
                            .ffill())

print (df_new)
   person_id   obs_date
0          1 2007-12-31
1          1 2007-12-31
2          3 2009-11-25
3          3 2009-11-25
4          5 2019-10-15
5          5 2019-10-15
Ad
source: stackoverflow.com
Ad