Groupby And Combine And Aggregate Multiple Groups Into One Single Group Based On Condition
I have a hierarchical time series pandas DataFrame that involves multiple top-level entities. I want to combine and aggregate series that have less than 12 data points (i.e., the count of
date is less than 12). Combine them so I can have more than 12 data points, and aggregate the values occurring on the same
Note that I don't want to combine those belonging to different
level0 groups as they are unrelated (e.g., A and Z are two distinct entities). In other words, I would like to perform combination and aggregation on
level2 within each
level0 group. The original dataset after
groupby is like this:
grp = df.groupby(['level0','level1','level2','date']).sum() #sum the values within the same date #because each level2 entity can have multiple records occurring on the same date values level0 level1 level2 date A AA AA_1 2006-10-31 300 # assume AA_1 have more than 12 data points so I dont't want to modify 2006-11-30 220 2006-12-31 415 ... ... 2007-04-30 19 2007-05-31 77 2007-08-31 463 AA_2 2006-04-30 600 # assume AA_2 has less than 12 data points 2006-05-31 2600 2007-09-30 6600 AB AB_1 2006-04-30 100 # assume AB_1 has less than 12 data points 2006-08-31 200 2007-06-30 300 2007-09-30 400 ... ... ... ... ... Z ZZ ZZ_9 2006-04-30 3680 # assume ZZ_9 has less than 12 data points 2006-09-30 277 2007-03-31 1490 2007-09-30 289 2007-10-31 387
I assume that both
AB_1 that belong to group
A have less than 12 data points so I want to combine them. They have two duplicated dates so for those two dates I want to sum up the value. After getting the new hierarchical group, I also want to drop the original ones.
ZZ_9 also has less than 12 data points, I won't combine it with the other two because
ZZ_9 belongs to group
The desired output is like this:
values level0 level1 level2 date A AA AA_1 2006-10-31 300 2006-11-30 220 2006-12-31 415 ... ... 2007-04-30 19 2007-05-31 77 2007-08-31 463 agg_lv1 agg_lv2 2006-04-30 700 (=600+100) # assume we have more than 12 data points now # as I don't want the code being lengthy 2006-05-31 2600 2006-08-31 200 ... ... 2007-06-30 300 2007-09-30 7000 (=6600+400) ... ... ... ... ... Z ZZ ZZ_9 2006-04-30 3680 2006-09-30 277 2007-03-31 1490 2007-09-30 289 2007-10-31 387
It's alright that each
level0 entity has the same name for the new aggregated levels (i.e.,
agg_lv2) because as mentioned
level0 entities are unrelated and I want to keep the naming simple.
How can this be done?
You can do this in multiple steps. First partition the dataframe into 2 where the first one contains all rows that need to be aggregated (both more than 12 time points and more than one
grp = grp.reset_index() grp['nunique'] = grp.groupby(['level0'])['level1'].transform('nunique') # partition grp_small = grp.loc[grp['nunique'] > 1].groupby(['level0', 'level1', 'level2']).filter(lambda x: len(x) < 12) idx_small = grp_small.index grp_large = grp.loc[set(grp.index) - set(idx_small)]
Now we can apply the
sum aggregation on the
grp_small dataframe while leaving
grp_large as it is.
grp_small = grp_small.groupby(['level0', 'date'], as_index=False).sum() grp_small[['level1', 'level2']] = ['agg_lv1', 'agg_lv2']
And finally, we concat the two dataframes together and apply some final postprocessing:
df = pd.concat([grp_large, grp_small], ignore_index=True) df = df.drop(columns='nunique').set_index(['level0', 'level1', 'level2', 'date']).sort_index()
Result with the given data (with added rows to the first group during computation):
values level0 level1 level2 date A AA AA_1 2006-10-31 300 2006-11-30 220 2006-12-31 415 ... ... 2007-04-30 19 2007-05-31 77 2007-08-31 463 agg_lv1 agg_lv2 2006-04-30 700 2006-05-31 2600 2006-08-31 200 2007-06-30 300 2007-09-30 7000 ... ... ... ... ... Z ZZ ZZ_9 2006-04-30 3680 2006-09-30 277 2007-03-31 1490 2007-09-30 289 2007-10-31 387
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module