Ad

Join Columns And Reshape In Row

- 1 answer

i have data like this:

data = {'Host' : ['A','A','A','A','A','A','B','B','B'], 'Duration' : ['1','2',None,'4','5',None,'7','8',None], 'Predict' : [None,None,'3',None,None,'6',None,None,'9']}
df = pd.DataFrame(data)

It looks like:


Host    Duration    Predict
0   A        1       None
1   A        2       None
2   A       None       3
3   A        4       None
4   A        5       None
5   A       None       6
6   B        7       None
7   B        8       None
8   B       None       9

What i expected to get:

A   1, 2, 3
A   4, 5, 6
B   7, 8, 9

I got what I wanted, but the way that I decided I do not like:

def create_vector(group):
    result = []
    df_array = []
    for index, item  in enumerate(group.Duration.ravel()):
        if (item != None):
            result.append(item)        
        else:            
            result.append(group.Predict.ravel()[index])
            result.append(-1)
    result = np.array(list(map(int, result)))
    splitted = np.split(result, np.where(result == -1)[0] + 1)
    for arr in splitted:
        if (len(arr) > 3):
            seq = ', '.join(str(e) for e in arr[:-1])
            df_array.append(seq)
    return pd.DataFrame(df_array,columns=['seq'])

Minimal length of arr must be one 'Duration' plus one 'Predict'

df= df.groupby(['host']).apply(create_vector)
df= df.reset_index().rename(columns={'level_1':'Index'})
df= df.drop(columns = {'Index'})

Would like to solve this problem using pandas. Waiting for comments and advice

Ad

Answer

I believe you can replace missing values from Duration by Predict column, so solution is simplify:

df['new'] = df['Duration'].fillna(df['Predict']).astype(str)

If need groups each 3 values by Host groups:

g = df.groupby('Host').cumcount() // 3

Or if need groups by Predict column with None separator - only necessary default index:

g = df.index.where(df['Predict'].notna()).to_series().bfill()

#if always unique values in Predic column
#g = df['Predict'].bfill()

df = (df.groupby(['Host', g])['new']
        .apply(', '.join)
        .reset_index(level=1, drop=True)
        .reset_index(name='Seq'))
print (df)
  Host      Seq
0    A  1, 2, 3
1    A  4, 5, 6
2    B  7, 8, 9

Another solution with reshape by DataFrame.stack - Nones or missing values are removed by default with again aggregate join:

g = df.groupby('Host').cumcount() // 3

df = (df.set_index(['Host', g])
        .stack()
        .astype(str)
        .groupby(level=[0,1])
        .apply(', '.join)
        .reset_index(level=1, drop=True)
        .reset_index(name='Seq')
        )
print (df)
  Host      Seq
0    A  1, 2, 3
1    A  4, 5, 6
2    B  7, 8, 9
Ad
source: stackoverflow.com
Ad