Get distinct values through direct query of an index
I've never used SQLite so bear with me here. But it seems as if this problem is common among many RDBMS platforms.
When you select distinct values from your column you end up scanning all rows in the index:
This can be a great strategy if there aren't many rows in the table or if the column doesn't have very many duplicate values. But if you have millions of rows for each distinct value then you'll scan millions of rows just to return a single unique value. For data sets like that, it can sometimes be better to get the first distinct value, then skip to the next value, and so on. This can be accomplished via recursion in some platforms. You can also run one query at a time with each getting the next distinct value. For example, you could get the first value with this query:
SELECT MIN(chromosome) FROM dbsnp;
Then get the next value with this query (substituting the filter with the values of the first query):
SELECT chromosome FROM dbsnp WHERE chromosome > 'TEST_1' ORDER BY chromosome LIMIT 1;
And the next:
SELECT chromosome FROM dbsnp WHERE chromosome > 'TEST_2' ORDER BY chromosome LIMIT 1;
And so on. For these queries I'm getting index seeks:
For a relatively small data set, the single distinct query takes about 320 ms and the series of
LIMIT 1 queries only took 4 ms. You'll of course need to write more code to use this solution, but it might be worth a shot.
In SQLite you can use
INDEXED BY my_index to tell the optimizer to use a particular named index (https://www.tutorialspoint.com/sqlite/sqlite_indexed_by.htm).
Try giving this a shot:
SELECT chromosome FROM dbsnp INDEXED BY chromosome GROUP BY chromosome
Edit: This does slow down a lot after 100 million records. After playing around some, you could be better of changing your program to do a quick check on the DB before hand:
db.execute("SELECT COUNT(*) FROM (SELECT 1 FROM dbsnp WHERE chromosome = ? LIMIT 1) sub", input).fetchall()
Your dataset is 1 if it exists or 0 if it doesn't, and it runs fast.