How to use time-series with Sqlite, with fast time-range queries?

First solution

The method (2) detailed in the question seems to work well. In a benchmark, I obtained:

  • naive method, without index: 18 MB database, 86 ms query time
  • naive method, with index: 32 MB database, 12 ms query time
  • method (2): 18 MB database, 12 ms query time

The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.

We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:

 MAX(?, (SELECT seq FROM sqlite_sequence) + 1)

Code:

import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0:  # timestamp increases of 1 second with probability 1%
        t += 1
    db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)

Using a WITHOUT ROWID table

Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.

db.executescript("""
    CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);

    CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
    
    CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
    """)

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
        t += 1
    db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)

Roughly-sorted UUID

More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:

  • ULID (Universally Unique Lexicographically Sortable Identifier)
  • Snowflake
  • MongoDB ObjectId

All these methods use an ID which is:

[---- timestamp ----][---- random and/or incremental ----]

I am not expert in SqlLite, but have worked with databases and time series. I have hade similar situation previously, and I would share my conceptual solution.

You have some how part of the answer in your question, but not the way of doing it.

The way I did it, creating 2 tables, one table (main_logs) will log time in seconds incrementation as date as integer as primary key and the other table logs contain all logs (main_sub_logs) that made in that particular time that in your case can be up to 10000 logs per second in it. The main_sub_logs has reference to main_logs and it contain for each log second and X number of logs belong to that second with own counter id, that starts over again.

This way you limit your time series look up to seconds of event windows instead of all logs in one place.

This way you can join those two tables and when you look up from in first table between 2 specific time you get all logs in between.

So what here is how I created my 2 tables:

CREATE TABLE IF NOT EXISTS main_logs (
  id INTEGER PRIMARY KEY
);

CREATE TABLE IF NOT EXISTS main_sub_logs (
   id INTEGER,
   ref INTEGER,
   log_counter INTEGER,
   log_text text,
   PRIMARY KEY (id), 
   FOREIGN KEY (ref) REFERENCES main_logs(id)
)

I have inserted some dummy data:

enter image description here

Now lets query all logs between 1608718655 and 1608718656

SELECT * FROM main_logs AS A
JOIN main_sub_logs AS B ON A.id == B.Ref
WHERE A.id >= 1608718655 AND A.id <= 1608718656

Will get this result:

enter image description here