Ad

Finding The Distance To The Next Higher Value In Pandas Dataframe

- 1 answer

I have a data frame containing floating point values

my_df = pd.DataFrame([1,2,1,4,3,2,5,4,7])

I'm trying to find for each number, when (how many indices need to move forward) till I find the next number larger than the current number, if there is no larger number, I mark it with some value (like 999999).

So for the example above, the correct answer should be

result = [1,2,1,3,2,1,2,1,999999]

Currently I've solved it by very slow double loop with itertuples (meaning O(n^2))

Is there a smarter way to do it ?

Ad

Answer

Here's a numpy based one leveraging broadcasting:

a = my_df.squeeze().to_numpy() # my_df.squeeze().values for versions 0.24.0.<

diff_mat = a - a[:,None]
result = (np.triu(diff_mat)>0).argmax(1) - np.arange(diff_mat.shape[1])
result[result <= 0] = 99999

print(result)

array([    1,     2,     1,     3,     2,     1,     2,     1, 99999],
      dtype=int64)

Where diff_mat is the distance matrix, and we're looking for the values from the main diagonal onwards, which are greater than 0:

array([[ 0,  1,  0,  3,  2,  1,  4,  3,  6],
       [-1,  0, -1,  2,  1,  0,  3,  2,  5],
       [ 0,  1,  0,  3,  2,  1,  4,  3,  6],
       [-3, -2, -3,  0, -1, -2,  1,  0,  3],
       [-2, -1, -2,  1,  0, -1,  2,  1,  4],
       [-1,  0, -1,  2,  1,  0,  3,  2,  5],
       [-4, -3, -4, -1, -2, -3,  0, -1,  2],
       [-3, -2, -3,  0, -1, -2,  1,  0,  3],
       [-6, -5, -6, -3, -4, -5, -2, -3,  0]], dtype=int64)

We have np.triu for that:

np.triu(diff_mat)

array([[ 0,  1,  0,  3,  2,  1,  4,  3,  6],
       [ 0,  0, -1,  2,  1,  0,  3,  2,  5],
       [ 0,  0,  0,  3,  2,  1,  4,  3,  6],
       [ 0,  0,  0,  0, -1, -2,  1,  0,  3],
       [ 0,  0,  0,  0,  0, -1,  2,  1,  4],
       [ 0,  0,  0,  0,  0,  0,  3,  2,  5],
       [ 0,  0,  0,  0,  0,  0,  0, -1,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  3],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int64)

And by checking which are greater than 0, and taking the argmax of the boolean ndarray we'll find the first value greater than 0 in each row:

(np.triu(diff_mat)>0).argmax(1)
array([1, 3, 3, 6, 6, 6, 8, 8, 0], dtype=int64)

We only need to subtract the corresponding offset from the main diagonal to the beginning

Ad
source: stackoverflow.com
Ad