Manipulation performance of Sqlite vs CSV file

One clear advantage is that you cannot index a csv file. If you have to use subsets of your large data set, creating an index on the column in the sqlite table is an advantage.


Unless you're doing something very trivial to the CSV, and only doing it once, SQLite will be faster for runtime, coding time, and maintenance time, and it will be more flexible.

The major advantages of putting the CSV into SQLite are...

  • Query with a known query language.
  • Query with a flexible query language.
  • Take advantage of high performance indexing.
  • Don't have to write and maintain and document and test a bunch of custom query code.

You can look at the costs like this:

SQLite

  • Once...
    • Create the schema.
    • Import the CSV into SQLite (built in).
      • This may require you to write some code to translate the values.
    • [Optional, but recommended] Set up the indexes.
  • For each different query...
    • Do your query in SQL.

CSV

  • For each different query...
    • Write special code for your query.
    • Document how to use this special code.
    • Test your special query code.
    • Debug your special query code.
    • Run your special query code which has to...
      • Read the CSV file.
      • Parse the CSV file.
      • (Optional) Index the CSV file.
        • Come up with an indexing scheme.
      • Run your query.

Note that if your query is simple parsing and running can happen together. Something like "find all columns where field 5 is greater than 10".


It's easy to forget that even if you use a library to do the CSV parsing, there are coding and maintenance costs to writing special code to query a CSV file. Every query has to be coded, tested, and debugged. Every special case or option has to be coded, tested, and debugged.

Since it's all special stuff you made up, there's no convention to follow. People coming to use your query program have to understand what it does and how it works. If they want to do anything even slightly different, they (or you) have to get into the code, understand it, modify it, test it, debug it, and document it. This will generate a lot of support requests.

In contrast, SQLite requires you to write little or no special code beyond the SQL queries. SQL is a commonly known query language. You can say "this is a SQLite database" and it's very likely people will know what to do. Alternatively they'll go learn SQL which is generally applicable knowledge. Whereas learning your special CSV query program is one-off knowledge.

If people want to run a query you didn't anticipate they can just write the SQL themselves. You don't need to be bothered, and they don't need to puzzle out a bunch of code.

Finally, SQLite's query time will be far better with a well indexed table than anything you or I are likely to write. SQLite is a database collaborated on by many, many database experts. You're probably not going to outperform the carefully optimized code they've written in C. Even if you can edge out a bit of performance, don't you have better things to do?

Tags:

Csv

Sqlite