Is There A Way To Vectorize Counting Items' Co-occurences In Pandas/numpy?
I frequently need to generate network graphs based on the co-occurences of items in a column. I start of with something like this:
letters
0 [b, a, e, f, c]
1 [a, c, d]
2 [c, b, j]
In the following example, I want a to make a table of all pairs of letters, and then have a "weight" column, which describes how many times each two letter pair appeared in the same row together (see bottom for example).
I am currently doing large parts of it using a for loop, and I was wondering if there is a way for me to vectorize it, as I am often dealing with enormous datasets that take an extremely long time to process in this way. I am also concerned about keeping things within memory limits. This is my code right now:
import pandas as pd
# Make some data
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
# I make a list of sets, which contain pairs of all the elements
# that co-occur in the data in the same list
sets = []
for lst in df['letters']:
for i, a in enumerate(lst):
for b in lst[i:]:
if not a == b:
sets.append({a, b})
# Sets now looks like:
# [{'a', 'b'},
# {'b', 'e'},
# {'b', 'f'},...
# Dataframe with one column containing the sets
df = pd.DataFrame({'weight': sets})
# We count how many times each pair occurs together
df = df['weight'].value_counts().reset_index()
# Split the sets into two seperate columns
split = pd.DataFrame(df['index'].values.tolist()) \
.rename(columns = lambda x: f'Node{x+1}') \
.fillna('-')
# Merge the 'weight' column back onto the dataframe
df = pd.concat([df['weight'], split], axis = 1)
print(df.head)
# Output:
weight Node1 Node2
0 2 c b
1 2 a c
2 1 f e
3 1 d c
4 1 j b
Answer
A numpy/scipy solution using sparse incidence matrices:
from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument
B = BenchmarkBuilder()
@B.add_function()
def pp(L):
SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
unq,idx = np.unique(np.concatenate(L),return_inverse=True)
S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
SS = ([email protected]).tocoo()
idx = (SS.col>SS.row).nonzero()
return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count
from collections import Counter
from itertools import combinations
@B.add_function()
def yatu(L):
return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))
@B.add_function()
def feature_engineer(L):
Counter((min(nodes), max(nodes))
for row in L for nodes in combinations(row, 2))
from string import ascii_lowercase as ltrs
ltrs = np.array([*ltrs])
@B.add_arguments('array size')
def argument_provider():
for exp in range(4, 30):
n = int(1.4**exp)
L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
yield n,L
r = B.run()
r.plot()
We see that the method presented here (pp
) comes with the typical numpy constant overhead, but from ~100 sublists it starts winning.
OPs example:
import pandas as pd
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))
Prints:
left right count
0 a b 1
1 a c 2
2 b c 2
3 c d 1
4 a d 1
5 c e 1
6 a e 1
7 b e 1
8 c f 1
9 e f 1
10 a f 1
11 b f 1
12 b j 1
13 c j 1
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module