Alternative to BigQuery for medium-sized data

I know SQL Server, so my answer is biased.

  1. 10M rows should easily fit in memory, so any kind of aggregation should be fast, especially if you have a covering index. If it doesn't, the server configuration may need adjustment. Also, SQL Server has so-called in-memory tables, which may be a good fit here.

  2. SQL Server has a feature called indexed view. Your aggregating query is a classic use case of an indexed view. Indexed view is essentially a copy of the data stored on disk and maintained by the server automatically as the underlying data in the table changes. It slows INSERTS, DELETES and UPDATES, but makes SELECT fast, because summary is always pre-calculated. See: What You Can (and Can’t) Do With Indexed Views. Other DBMSes should have similar features.


2020 update: Check out BigQuery BI Engine, the built-in accelerator of queries for dashboards:

  • https://cloud.google.com/bi-engine/docs/overview

If you need answers in less than a second, you need to think about indexing.

Typical story:

  1. MySQL (or any other database proposed here) is fast, until...
  2. One day some of your aggregation queries start running slow. Minutes, hours, days, etc.
  3. Typical solution for step 2 is indexing and pre-aggregating. If you want answers in less than a second for certain type of questions, you'll need to invest time and optimization cycles to answer just that type of questions.
  4. BigQuery's beauty is that you can skip step 3. Bring those minutes/hours/days to seconds, with minimal investment - any query, at any time.

BigQuery is awesome because it gives you 4. But you are asking for 3, MySQL is fine for that, Elasticsearch is fine too, any indexed database will bring you results in less than a second - as long as you invest time on optimizing your system for certain type of question. Then to get answers for any arbitrary question without investing any optimization time, use BigQuery.

BigQuery: Will answer arbitrary questions in seconds, no preparation needed.

MySQL and alternatives: Will answer certain type of questions in less than a second, but it will take development time to get there.


If you don't need concurrency, multiple users connecting simultaneously, and your data can fit in a single disk file, then SQLite might be appropriate.

As they say, SQLite does not compete with client/server databases. SQLite competes with fopen().

http://www.sqlite.org/whentouse.html


Here are a few alternatives to consider for data of this size:

  1. Single Redshift small SSD node
    • No setup. Easily returns answers on this much data in under 1s. 
  2. Greenplum on a small T2 instance
    • Postgres-like. Similar perf to Redshift. Not paying for storage you won't need. Start with their single node "sandbox" AMI.
  3. MariaDB Columnstore
    • MySQL-like. Used to be called InfiniDB. Very good performance. Supported by MariaDB (the company).
  4. Apache Drill
    • Drill has a very similar philosophy to BiqQuery but can be used to anywhere (it's just a jar). Queries will be fast on this size data.

If low admin / quick start is critical go with Redshift. If money / flexibility is critical start with Drill. If you prefer MySQL start with MariaDB Columnstore.