Ad

Effectively Classifying A DB Of Event Logs By A Column

- 1 answer

Situation

I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)

I have a sqlite database that looks like:

| user_id | action | timestamp  |
| ------- | ------ | ---------- |
| Alice   |      0 | 1551683796 |
| Alice   |     23 | 1551683797 |
| James   |      1 | 1551683798 |
| ....... | ...... | .......... |

where user_id is TEXT, action is an arbitary INTEGER, and timestamp is an INTEGER representing UNIX time.

The database has 200M rows, and there are 70K distinct user_ids.

Goal

I need to make a Python dictionary that looks like:

{
    "Alice":[(0, 1551683796), (23, 1551683797)],
    "James":[(1, 1551683798)],
    ...
}

that has user_ids as keys and respective event logs as values, which are lists of tuples (action, timestamp). Hopefully each list will be sorted by timestamp in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.

Effort

I have the following code to query the database. It first queries for the list of users (with user_list_cursor), and then query for all rows belonging to the user.

import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()

classified_log = {}
log_cursor = connection.cursor()
while user_id:
    user_id = user_id[0] # cursor.fetchone() returns a tuple
    query = (
        "SELECT action, timestamp"
        " FROM TABLE"
        " WHERE user_id = ?"
        " ORDER BY timestamp ASC"
    )
    parameters = (user_id,)
    local_cursor.execute(query, parameters) # Here is the bottleneck
    classified_log[user_id] = list()
    for row in local_cursor.fetchall():
        classified_log[user_id].append(row)
        user_id = user_list_cursor.fetchone()

Problem

The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?

I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.

Reproducibility

If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.

But I hope there is somebody who is familiar with such a situation and knows how to write the right query.

import sqlite3
import random

connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
    "CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
    user_id = f"user{random.randint(0, 70000)}"
    action = random.randint(0, 1_000_000)
    parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()
Ad

Answer

Big thanks to @Strawberry and @Solarflare for their help given in comments.

The following solution achieved more than 70X performance increase, so I'm leaving what I did as an answer for completeness' sake.

I used indices and queried for the whole table, as they suggested.

import sqlite3
from operators import attrgetter

connection = sqlite3.connect("database.db")

# Creating index, thanks to @Solarflare
cursor = connection.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_user_id ON EVENT_LOG (user_id)")
cursor.commit()

# Reading the whole table, then make lists by user_id. Thanks to @Strawberry
cursor.execute("SELECT user_id, action, timestamp FROM EVENT_LOG ORDER BY user_id ASC")
previous_user_id = None
log_per_user = list()
classified_log = dict()
for row in cursor:
    user_id, action, timestamp = row
    if user_id != previous_user_id:
        if previous_user_id:
            log_per_user.sort(key=itemgetter(1))
            classified_log[previous_user_id] = log_per_user[:]
        log_per_user = list()
    log_per_user.append((action, timestamp))
    previous_user_id = user_id

So the points are

  • Indexing by user_id to make ORDER BY user_id ASC execute in acceptable time.
  • Reading the whole table, then classify by user_id, instead of making individual queries for each user_id.
  • Iterating over cursor to read row by row, instead of cursor.fetchall().
Ad
source: stackoverflow.com
Ad