# Function Applied To The Whole Dataset

## 29 August 2019 - 1 answer

Manually defined p and q:

``````p = [[45.1024,7.7498],[45.1027,7.7513],[45.1072,7.7568],[45.1076,7.7563]]
q = [[45.0595,7.6829],[45.0595,7.6829],[45.0564,7.6820],[45.0533,7.6796],[45.0501,7.6775]]
``````
• Step 1 (fine)

Part of code which is fine

``````def _c(ca, i, j, p, q):
if ca[i, j] > -1:
return ca[i, j]
elif i == 0 and j == 0:
ca[i, j] = np.linalg.norm(p[i]-q[j])
elif i > 0 and j == 0:
ca[i, j] = max(_c(ca, i-1, 0, p, q), np.linalg.norm(p[i]-q[j]))
elif i == 0 and j > 0:
ca[i, j] = max(_c(ca, 0, j-1, p, q), np.linalg.norm(p[i]-q[j]))
elif i > 0 and j > 0:
ca[i, j] = max(
min(
_c(ca, i-1, j, p, q),
_c(ca, i-1, j-1, p, q),
_c(ca, i, j-1, p, q)
),
np.linalg.norm(p[i]-q[j])
)
else:
ca[i, j] = float('inf')
return ca[i, j]
``````
• Step 2(problem is here) is
``````def frdist(p, q):

# Remove nan values from p
p = np.array([i for i in p if np.any(np.isfinite(i))], np.float64) # ESSENTIAL PART TO REMOVE NaN
q = np.array([i for i in q if np.any(np.isfinite(i))], np.float64) # ESSENTIAL PART TO REMOVE NaN

len_p = len(p)
len_q = len(q)

if len_p == 0 or len_q == 0:
raise ValueError('Input curves are empty.')

# p and q no longer have to be the same length
if len(p[0]) != len(q[0]):
raise ValueError('Input curves do not have the same dimensions.')

ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

dist = _c(ca, len_p-1, len_q-1, p, q)
return(dist)
``````
``````frdist(p, q)
0.09754839824415232
``````

Question: What to do in Step 2 to apply the code to the given (again sample dataset. The real one is very big)dataset df:

``````    1           1.1     2           2.1     3           3.1     4           4.1     5           5.1
0   43.1024     6.7498  NaN         NaN     NaN         NaN     NaN         NaN     NaN         NaN
1   46.0595     1.6829  25.0695     3.7463  NaN         NaN     NaN         NaN     NaN         NaN
2   25.0695     5.5454  44.9727     8.6660  41.9726     2.6666  84.9566     3.8484  44.9566     1.8484
3   35.0281     7.7525  45.0322     3.7465  14.0369     3.7463  NaN         NaN     NaN         NaN
4   35.0292     7.5616  45.0292     4.5616  23.0292     3.5616  45.0292     6.7463  NaN
``````

By taking p first row and q second row. Then compute the distance `frdist(p, q)`. Then again p is the first row but q now third row. Then 1 and 3.

Finally I should get the matrix in the size of rows (rows, rows) with 0 diagonal. Because distance between itself is 0:

`````` 0 1 2 3 4 5 ... 105
0 0
1   0
2     0
3       0
4         0
5           0
...           0
105              0
``````

Since your working code expects list of lists as argument, you need to convert each row of your dataframe to a list of lists like the `p` and `q` of your example. Assuming `df` is your dataframe, you can do this in the following way:

``````def pairwise(it):
a = iter(it)
return zip(a, a)

ddf = df.apply(lambda x : [pair for pair in pairwise(x)], axis=1)
``````

I took the `pairwise` function from this answer.

`ddf` is a dataframe with one column, each element is a list like `p` or `q`.

Then you need to work with combinations of the row indexes. Have a look at the itertools module. Depending on your needs, you can use one from product, permutations or combinations.

If you want to do each combination, you can use:

``````from itertools import product
idxpairs = product(ddf.index, repeat=2)
``````

`idxpairs` holds all possible pairs of the indexes in your dataframe. You can loop over them.

You can build your final matrix like this:

``````fmatrix = pd.DataFrame(index=ddf.index, columns=ddf.index)

for pp in idxpairs:
fmatrix.loc[pp[0], pp[1]] = frdist(ddf.iloc[pp[0]], ddf.iloc[pp[1]])
``````

Now this will compute brute force each element. If you have a big dataframe and you know in advance that your final matrix will have given properties, like the diagonal is 0 and it is symmetric (I guess `frdist(p, q) == frdist(q, p)`) you can save some time by using for example `combinations` instead of `product` to not perform the same calculations twice:

``````from itertools import combinations
idxpairs = combinations(ddf.index, 2)

fmatrix = pd.DataFrame(index=ddf.index, columns=ddf.index)

for pp in idxpairs:
res = frdist(ddf.iloc[pp[0]], ddf.iloc[pp[1]])
fmatrix.loc[pp[0], pp[1]] = res
fmatrix.loc[pp[1], pp[0]] = res
``````