Ad

How To Transform A Dataframe For Getting Time Of Various Occurence Of Events?

- 1 answer

Given the following DataFrame:

+-------+-----+-------+-----+--------+---------------------------+
|  DID  | CID | Event | OID | Source |         TimeStamp         |
+-------+-----+-------+-----+--------+---------------------------+
| 25078 |  14 | QBT   |   0 | EMS    | 2019-10-15 10:54:35 +0000 |
| 25078 |  14 | NDOBT |   0 | EMS    | 2019-10-15 10:54:48 +0000 |
| 25078 |  14 | SBT   |   0 | EMS    | 2019-10-15 10:54:52 +0000 |
| 25078 |  14 | SBT-1 |   0 | ECS    | 2019-10-15 11:00:01 +0000 |
| 25078 |  14 | SBT-1 |   0 | ECS    | 2019-10-15 11:00:26 +0000 |
| 25078 |  14 | SBT-1 |   0 | ECS    | 2019-10-15 11:00:50 +0000 |
| 25078 |  14 | SBT   |   0 | EMS    | 2019-10-15T14:27:45       |
| 25078 |  14 | SBT   |   0 | EMS    | 2019-10-15T14:27:45       |
| 25078 |  14 | LSFA  |   0 | SPDLS  | 2019-10-15T14:28:16       |
| 25078 |  14 | LSFA  |   0 | SPDLS  | 2019-10-15T14:28:16       |
| 25078 |  14 | FEAR  |   0 | CBS    | 2019-10-15T14:28:18       |
| 25078 |  14 | FEAR  |   0 | CBS    | 2019-10-15T14:28:18       |
| 25078 |  14 | SBT   |   0 | EMS    | 2019-10-15T14:28:44       |
| 25078 |  14 | SBT   |   0 | EMS    | 2019-10-15T14:28:44       |
| 25078 |  14 | LSFA  |   0 | SPDLS  | 2019-10-15T14:30:55       |
| 25078 |  14 | LSFA  |   0 | SPDLS  | 2019-10-15T14:30:55       |
| 25078 |  14 | SBT   |   0 | EMS-1  | 2019-10-15T15:28:43       |
| 25078 |  14 | SBT   |   0 | EMS-1  | 2019-10-15T15:29:02       |
| 25078 |  14 | FEAR  |   0 | CBS    | 2019-10-15T15:30:51       |
| 25078 |  14 | FEAR  |   0 | CBS    | 2019-10-15T15:30:51       |
| 25078 |  14 | DBT   |   0 | RS     | 2019-10-15T15:44:23       |
| 25078 |  14 | QBT   |   0 | EMS-1  | 2019-10-15T16:02:16       |
+-------+-----+-------+-----+--------+---------------------------+

I want to get the first and last occurrences of some events and source so that the final output looks something like this:

+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
|  DID  | CID |  Event-QBT-Last-DT  | Event-QBT-First-DT |  Event-SBT-Last-DT  | Event-SBT-First-DT |    Screen-ECS-First-DT    |    Screen-ECS-Last-DT     |      FirstTimeUsage       |   LastTime Usage    |
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
| 25078 |  14 | 2019-10-15T16:02:16 | 10/15/19 10:54 AM  | 2019-10-15T15:29:02 | 10/15/19 10:54 AM  | 2019-10-15 11:00:01 +0000 | 2019-10-15 11:00:50 +0000 | 2019-10-15 10:54:35 +0000 | 2019-10-15T16:02:16 |
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+

How can i achieve this using pandas.

Ad

Answer

Idea is filtering rows by boolean indexing with DataFrame.isin for get first and last occurrences of some events, then use GroupBy.agg with first and last and reshape by DataFrame.unstack, last flatten MultiIndex in columns:

L = ['QBT','SBT']

df1 = (df[df['Event'].isin(L)]
         .groupby(['OID','DID','CID','Event'])['TimeStamp']
         .agg([('Last-DT','last'), ('First-DT','first')])
         .unstack()
         .sort_index(axis=1, level=1))
df1.columns = [f'Event-{b}-{a}' for a, b in df1.columns]
#print (df1)

For first and last occurence use first solution with no filter, no Event in groupby and no unstack:

df2 = (df.groupby(['OID','DID','CID'])['TimeStamp']
         .agg([('FirstTimeUsage','first'), ('LastTime Usage','last')]))
#print (df2)

Last join by DataFrame.join together:

df = df1.join(df2).reset_index()
print (df)
   OID    DID  CID         Event-QBT-First-DT    Event-QBT-Last-DT  \
0    0  25078   14  2019-10-15 10:54:35 +0000  2019-10-15T16:02:16   

          Event-SBT-First-DT    Event-SBT-Last-DT             FirstTimeUsage  \
0  2019-10-15 10:54:52 +0000  2019-10-15T15:29:02  2019-10-15 10:54:35 +0000   

        LastTime Usage  
0  2019-10-15T16:02:16  

EDIT: For processing next column a bit modify generate df1:

L = ['QBT','SBT']

df1 = (df[df['Event'].isin(L)]
         .groupby(['OID','DID','CID','Event'])['TimeStamp']
         .agg([('Last-DT','last'), ('First-DT','first')])
         .unstack()
         .sort_index(axis=1, level=1))
df1.columns = [f'Event-{b}-{a}' for a, b in df1.columns]
#print (df1)

L2 = ['ECS']
df11 = (df[df['Source'].isin(L2)]
         .groupby(['OID','DID','CID','Source'])['TimeStamp']
         .agg([('Last-DT','last'), ('First-DT','first')])
         .unstack()
         .sort_index(axis=1, level=1))
df11.columns = [f'Screen-{b}-{a}' for a, b in df11.columns]

df2 = (df.groupby(['OID','DID','CID'])['TimeStamp']
         .agg([('FirstTimeUsage','first'), ('LastTime Usage','last')]))

An d last use concat:

df = pd.concat([df1, df11, df2], axis=1).reset_index()
print (df)
   OID    DID  CID         Event-QBT-First-DT    Event-QBT-Last-DT  \
0    0  25078   14  2019-10-15 10:54:35 +0000  2019-10-15T16:02:16   

          Event-SBT-First-DT    Event-SBT-Last-DT        Screen-ECS-First-DT  \
0  2019-10-15 10:54:52 +0000  2019-10-15T15:29:02  2019-10-15 11:00:01 +0000   

          Screen-ECS-Last-DT             FirstTimeUsage       LastTime Usage  
0  2019-10-15 11:00:50 +0000  2019-10-15 10:54:35 +0000  2019-10-15T16:02:16  
Ad
source: stackoverflow.com
Ad