Ad

Change Value Of A Slice In Pandas Depending On The Number Of Rows In The Slice

- 1 answer

I have a pandas dataframe that looks like this

import pandas as pd

df = pd.DataFrame({'Timestamp': ['1642847484', '1642847484', '1642847484', '1642847484', '1642847487', '1642847487','1642847487','1642847487','1642847487','1642847487','1642847487','1642847487', '1642847489', '1642847489', '1642847489'],
                   'value': [11, 10, 14, 20, 3, 2, 9, 48, 5, 20, 12, 20, 56, 12, 8]})

The data is collected in batches which results in multiple lines having the same timestamp . I need to index the dataframe with time and to do so the indexes must have unique values.

The problem as you can see is:

  • The timestamp step is varriant
  • The number of rows for each timestep is varriant

The approach I tried is

  1. Multiply timestamp by 1000 to get microseconds
  2. calculate the time beween timestep i and the next timestep j delta = j-i
  3. count the number of rows n between i and j
  4. for each row between i and j add ( 1/n+1 * rank) seconds

expected output:

        Timestamp  value
0   1642847484000     11
1   1642847484750     10
2   1642847485500     14
3   1642847484000     20
4   1642847487000      3
5   1642847487250      2
6   1642847487500      9
7   1642847487750     48
8   1642847488000      5
9   1642847488250     20
10  1642847488500     12
11  1642847488750     20
12  1642847489000     56
13  1642847489333     12
14  1642847489666      8
15  1642847490000      4

But I couldn't find a way to that efficiently, I used loops but I have 15M+ rows

Is there a simpler way to do it ? Thank you

Ad

Answer

IIUC, you want to de-duplicate using interpolated values.

A simple way would be to mask the duplicates and to interpolate:

s = df['Timestamp'].astype(int)
df['Timestamp2'] = (s.mul(1000)                      # to µs
                     .mask(s.duplicated())           # mask dups
                     .interpolate(downcast= 'infer') # interpolate
                     .astype(str)                    # back to string
                   )

output:

     Timestamp  value     Timestamp2
0   1642847484     11  1642847484000
1   1642847484     10  1642847484750
2   1642847484     14  1642847485500
3   1642847484     20  1642847486250
4   1642847487      3  1642847487000
5   1642847487      2  1642847487250
6   1642847487      9  1642847487500
7   1642847487     48  1642847487750
8   1642847487      5  1642847488000
9   1642847487     20  1642847488250
10  1642847487     12  1642847488500
11  1642847487     20  1642847488750
12  1642847489     56  1642847489000
13  1642847489     12  1642847489000
14  1642847489      8  1642847489000
Ad
source: stackoverflow.com
Ad