Which Is The Fastest Way Of Querying Strings In Python

- 1 answer

Assume you have to query the prices of like 100 items (in the example I will query just 2) that are unique, which would be the fastest (measured in seconds per query) way of querying the name of these items using a python data structure?

Here is the example with a pandas dataframe (with just querying 3 items, I would like to query around 50 or 100).

import pandas as pd
import time
df = pd.DataFrame({"price":[11,33,5,29,999]*100000},index=["car","boat","axe","fork","plane"]*100000)
# now query the price of these items:
time_start = time.time()
query = df.loc[["car","boat","plane"]]
time_elapsed = round(time.time()-time_start,2)
print(f"[INFO] Time elapsed: {time_elapsed} seconds")

Can you think of faster ways than the one I showed? Ideally I would not want to think of a database, a data structure instead but I'm open to suggestion about databases like mongodb (not as answers, just as comments).




I doubt you'll get much better than using a mask:

import pandas as pd
df = pd.DataFrame({"price":[11,33,5,29,999]*100000},index=["car","boat","axe","fork","plane"]*100000)
items = ["car","boat","plane"]

def v1(df, items):
    df = df.copy()
    return df.T[items].T

def v2(df, items):
    df = df.copy()
    return df.loc[items]

def v3(df, items):
    df = df.copy()
    return df[df.index.isin(items)]

print('Treating index as columns:')
%timeit v1(df, items)
print('\nUsing loc:')
%timeit v2(df, items)
print('\nUsing a mask:')
%timeit v3(df, items)


Treating index as columns:
108 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using loc:
57 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using a mask:
12.8 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)