Ad

Fastest Way To Convert Huge Dictionary To A Dataframe

- 1 answer

I have a dictionary that looks like this:

dict = {A: {A: 0,
  B: 1,
  C: 1,
  D: 2,
  E: 2,
  F: 2,
  G: 2,
  H: 2,
  I: 3},
 B: {B: 0,
  A: 1,
  K: 1,
  O: 1,
  M: 1,
  Q: 1,
  L: 1,
  Z: 2,
  T: 2},
 C: {C: 0,
  R: 1,
  A: 1,
  D: 2,
  F: 2,
  J: 2,
  E: 2,
  Y: 2,
  B: 2},
 D: {D: 0,
  F: 1,
  H: 1,
  I: 1,
  E: 1,
  A: 2,
  C: 2,
  S: 2,
  U: 3}

But in fact it is way bigger (up to 60K keys) and I need a very fast and efficient way to turn this dictionary into a dataframe that looks like this:

person_1     person_2    degree
   A            A          0
   A            B          1 
   A            C          1
   A            D          2
   A            E          2 
   A            F          2
   A            G          2
   A            H          2 
   A            I          3
   B            B          0
   B            A          1 
   B            K          1 
   B            O          1
   B            M          1 
   B            Q          1 
   B            L          1
   B            Z          2 
   B            T          2 
   C            C          0
   C            R          1 
   C            A          1 
   C            D          2
   C            F          2 
   C            J          2 
   C            E          2
   C            Y          2 
   C            B          2
   D            D          0
   D            F          1 
   D            H          1 
   D            I          1
   D            E          1 
   D            A          2 
   D            C          2
   D            S          2 
   D            U          3

So basically I need a dataframe where each comes from the dictionary keys and their values, and the third column is the number inside that key. What I'm doing right now is to convert the dictionary to df using df = pd.DataFrame(dict) and then

df = pd.melt(df, 'index').rename(columns = {'index': 'hcp_npi', 
                                            'variable':'connected_hcp_npi', 
                                            'value': 'degree_of_separation'}).dropna()

And I get the result I need. But the problem with this approach is that when the dictionary exceeds 20K keys, the melt function just takes forever to run. So I'm looking a faster or more efficient way to go from the initial dictionary to the last dataframe.

Thanks!

Ad

Answer

It looks like it's faster to pre-process the dictionary into the column values:

from collections import defaultdict

d2 = defaultdict(list)
for k, v in d.items():
    d2['person_1'] += [k] * len(v)
    d2['person_2'] += list(v.keys())
    d2['degree'] += list(v.values())

df = pd.DataFrame(d2)

I tested your method, @jezrael, @BENYs (now deleted) and mine using timeit and code like this (replacing the stmt as appropriate):

timeit.timeit(setup='''
import pandas as pd
d = {'A': {'A': 0, 'B': 1, 'C': 1, 'D': 2, 'E': 2, 'F': 2, 'G': 2, 'H': 2, 'I': 3},
 'B': {'B': 0, 'A': 1, 'K': 1, 'O': 1, 'M': 1, 'Q': 1, 'L': 1, 'Z': 2, 'T': 2},
 'C': {'C': 0, 'R': 1, 'A': 1, 'D': 2, 'F': 2, 'J': 2, 'E': 2, 'Y': 2, 'B': 2},
 'D': {'D': 0, 'F': 1, 'H': 1, 'I': 1, 'E': 1, 'A': 2, 'C': 2, 'S': 2, 'U': 3}
 }
''',
stmt='''
df = pd.DataFrame(d)
df = pd.melt(df).rename(columns = {'index': 'hcp_npi', 
                                            'variable':'connected_hcp_npi', 
                                            'value': 'degree_of_separation'}).dropna()
''',
number=1000)

For 1000 iterations, the results were:

Nick      0.2878
jezrael   0.3178
BENY      2.2822
TomasCB   2.2774

For reference, I include @BENY answer here:

pd.concat({x : pd.Series(y) for x , y in d.items()}).reset_index()
Ad
source: stackoverflow.com
Ad