# Is There A Way To Vectorize Counting Items' Co-occurences In Pandas/numpy?

## 23 October 2019 - 1 answer

I frequently need to generate network graphs based on the co-occurences of items in a column. I start of with something like this:

``````           letters
0  [b, a, e, f, c]
1        [a, c, d]
2        [c, b, j]
``````

In the following example, I want a to make a table of all pairs of letters, and then have a "weight" column, which describes how many times each two letter pair appeared in the same row together (see bottom for example).

I am currently doing large parts of it using a for loop, and I was wondering if there is a way for me to vectorize it, as I am often dealing with enormous datasets that take an extremely long time to process in this way. I am also concerned about keeping things within memory limits. This is my code right now:

``````import pandas as pd

# Make some data
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})

# I make a list of sets, which contain pairs of all the elements
# that co-occur in the data in the same list
sets = []
for lst in df['letters']:
for i, a in enumerate(lst):
for b in lst[i:]:
if not a == b:
sets.append({a, b})

# Sets now looks like:
# [{'a', 'b'},
#  {'b', 'e'},
#  {'b', 'f'},...

# Dataframe with one column containing the sets
df = pd.DataFrame({'weight': sets})

# We count how many times each pair occurs together
df = df['weight'].value_counts().reset_index()

# Split the sets into two seperate columns
split = pd.DataFrame(df['index'].values.tolist()) \
.rename(columns = lambda x: f'Node{x+1}') \
.fillna('-')

# Merge the 'weight' column back onto the dataframe
df = pd.concat([df['weight'], split], axis = 1)

# Output:
weight Node1 Node2
0       2     c     b
1       2     a     c
2       1     f     e
3       1     d     c
4       1     j     b
``````

A numpy/scipy solution using sparse incidence matrices:

``````from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument

B = BenchmarkBuilder()

def pp(L):
SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
unq,idx = np.unique(np.concatenate(L),return_inverse=True)
S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
SS = ([email protected]).tocoo()
idx = (SS.col>SS.row).nonzero()
return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count

from collections import Counter
from itertools import combinations

def yatu(L):
return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))

def feature_engineer(L):
Counter((min(nodes), max(nodes))
for row in L for nodes in combinations(row, 2))

from string import ascii_lowercase as ltrs

ltrs = np.array([*ltrs])

def argument_provider():
for exp in range(4, 30):
n = int(1.4**exp)
L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
yield n,L

r = B.run()
r.plot()
`````` We see that the method presented here (`pp`) comes with the typical numpy constant overhead, but from ~100 sublists it starts winning.

OPs example:

``````import pandas as pd

df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))
``````

Prints:

``````   left right  count
0     a     b      1
1     a     c      2
2     b     c      2
3     c     d      1
4     a     d      1
5     c     e      1
6     a     e      1
7     b     e      1
8     c     f      1
9     e     f      1
10    a     f      1
11    b     f      1
12    b     j      1
13    c     j      1
``````