Effectively Classifying A DB Of Event Logs By A Column
Situation
I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)
I have a sqlite database that looks like:
| user_id | action | timestamp |
| ------- | ------ | ---------- |
| Alice | 0 | 1551683796 |
| Alice | 23 | 1551683797 |
| James | 1 | 1551683798 |
| ....... | ...... | .......... |
where user_id
is TEXT
, action
is an arbitary INTEGER
, and timestamp
is an INTEGER
representing UNIX time.
The database has 200M rows, and there are 70K distinct user_id
s.
Goal
I need to make a Python dictionary that looks like:
{
"Alice":[(0, 1551683796), (23, 1551683797)],
"James":[(1, 1551683798)],
...
}
that has user_id
s as keys and respective event logs as values, which are lists of tuples (action, timestamp)
. Hopefully each list will be sorted by timestamp
in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.
Effort
I have the following code to query the database. It first queries for the list of users (with user_list_cursor
), and then query for all rows belonging to the user.
import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()
classified_log = {}
log_cursor = connection.cursor()
while user_id:
user_id = user_id[0] # cursor.fetchone() returns a tuple
query = (
"SELECT action, timestamp"
" FROM TABLE"
" WHERE user_id = ?"
" ORDER BY timestamp ASC"
)
parameters = (user_id,)
local_cursor.execute(query, parameters) # Here is the bottleneck
classified_log[user_id] = list()
for row in local_cursor.fetchall():
classified_log[user_id].append(row)
user_id = user_list_cursor.fetchone()
Problem
The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id
. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?
I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.
Reproducibility
If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.
But I hope there is somebody who is familiar with such a situation and knows how to write the right query.
import sqlite3
import random
connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
user_id = f"user{random.randint(0, 70000)}"
action = random.randint(0, 1_000_000)
parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()
Answer
Big thanks to @Strawberry and @Solarflare for their help given in comments.
The following solution achieved more than 70X performance increase, so I'm leaving what I did as an answer for completeness' sake.
I used indices and queried for the whole table, as they suggested.
import sqlite3
from operators import attrgetter
connection = sqlite3.connect("database.db")
# Creating index, thanks to @Solarflare
cursor = connection.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_user_id ON EVENT_LOG (user_id)")
cursor.commit()
# Reading the whole table, then make lists by user_id. Thanks to @Strawberry
cursor.execute("SELECT user_id, action, timestamp FROM EVENT_LOG ORDER BY user_id ASC")
previous_user_id = None
log_per_user = list()
classified_log = dict()
for row in cursor:
user_id, action, timestamp = row
if user_id != previous_user_id:
if previous_user_id:
log_per_user.sort(key=itemgetter(1))
classified_log[previous_user_id] = log_per_user[:]
log_per_user = list()
log_per_user.append((action, timestamp))
previous_user_id = user_id
So the points are
- Indexing by
user_id
to makeORDER BY user_id ASC
execute in acceptable time. - Reading the whole table, then classify by
user_id
, instead of making individual queries for eachuser_id
. - Iterating over
cursor
to read row by row, instead ofcursor.fetchall()
.
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module