Maintaining Proportionality Of Sum Constrained Row When Changing A Single Row Value

- 1 answer

I have a dataset that consists of compositional data. Each column represents the percentage (decimal value) of a component in the mixture whole. Each row sums to 1.

If one component in the mixture changes, the rest must change accordingly so as to satisfy the sum constraint.

I am performing a multiple linear regression with this data and it requires some transformations such that the regression coefficients are meaningful and interpretable. The dataset contains zero values and this is an issue for the particular type of transformation I am trying to implement.

Before I can perform this transformation the recommended action is to replace all zero values with a small number and to adjust the the remaining components such that the sum constraint is still satisfied.

You can see in the dummy df below there are cases where there are more than 1 zero value in a row.

data = {'X1': [0.21, 0.08, 0.57, 0.03],
        'X2': [0.27, 0.56, 0.0, 0.02],
        'X3': [0.0, 0.14, 0.0, 0.45],
        'X4': [0.13, 0.02, 0.26, 0.37],
        'X5': [0.39, 0.2, 0.17, 0.13]}

df = pd.DataFrame(data)


Lets just consider one row and so the formula to do so is as follows:

let the original value be r_i. For a change in component r_i of delta_i we get the new value x_i.

so, x_i = r_i + delta_i

To maintain the relative proportion between the remaining components,

let r_j be the original value of the remaining components,

the new value for the jth component x_j is then,

x_j = r_j - ((r_j / (1 - r_i) * delta_i) and i != j

I am struggling to write an appropriate loop for this problem that will search the dataset for zero values, then add a small number to the index and column containing the zero value and then go on to adjust the entire row with the formula I described above.


Sorry for the bad representation of the maths formula.

For the first row in the dummy df the application of the formula is straight forward as there is only one zero in the row:

Table showing updated values of X1, X2, X4 and X5 when a small value is added to X3

It's important that the relative proportions between the remaining components remains the same, you can see that here as I update the zero value to a small number.

Relative proportions from the first row of the dummy df

For the third row in the dummy df things get a bit more complicated. I update the first (X2) zero value by adding a small number. The second (X3) zero value remains zero because the formula is multiplying and dividing by zero. So I make a second update such that X2 and X3 are now small non zero values and this this shown on the third row of the table below.

Table showing updated values of X1, X4 and X5 when a small value is added to X2 and X3

It's the same case for maintaining relative proportions between remaining components for the case where more than one zero exists on the row.

Relative proportions from the third row of the dummy df

I can't think of a loop for the first problem, never mind the second! Also, don't worry about the large numbers created by dividing by a small number in the relative proportions tables, I'll deal with that later.



Here is the answer following edit:

import pandas as pd
# To show 10 decimal points.
pd.options.display.float_format = '{:.10f}'.format

data = {'X1': [0.21, 0.08, 0.57, 0.03],
        'X2': [0.27, 0.56, 0.0, 0.02],
        'X3': [0.0, 0.14, 0.0, 0.45],
        'X4': [0.13, 0.02, 0.26, 0.37],
        'X5': [0.39, 0.2, 0.17, 0.13]}

df = pd.DataFrame(data)

delta_i = 0.000001
r_i = 0.0

# Provided formula.
def adjust_proportion(r_j, r_i, delta_i):
    return r_j - ((r_j / (1 - r_i)) * delta_i)

# For row-wise application.
def adjust_row(row, r_i, delta_i):

    # Get all zeros and their count in the row.
    zero_mask = (row == 0)
    zero_count = row[zero_mask].shape[0] # Get only x.

    # For every zero, adjust proportions for "cells" not in mask.
    for i in range(zero_count):
        row[~zero_mask] = row[~zero_mask].apply(lambda x: adjust_proportion(x, r_i, delta_i))

    # Increase the mask by delta_i across the board.
    row[zero_mask] += delta_i

    return row

# Apply ROW-WISE using axis=1.
df.apply(lambda x: adjust_row(x, r_i, delta_i), axis=1)

# Check sums.
print(df.apply(lambda x: x.sum(), axis=1))

This gives the following result:

enter image description here

There are more optimal ways but this should take care of the general logic.