Which database could handle storage of billions/trillions of records?

In a company I work for we are dealing with similar amount of data (around 10 TBs of realtime searchable data). We solve this with Cassandra and I would like to mention couple of ideas that will allow you to do O(1) search on a multi TBs database. This is not specific to Cassandra db though, you can use it with other db as well.

Theory

  • Shard your data. There is no way a single server will reliably and realistically hold such volume of data.
  • Be ready for hardware faults and whole node failures, duplicate the data.
  • Start using many back-end servers from the beginning.
  • Use many cheaper commodity servers, compared to top-end high performance ones.
  • Make sure data is equally distributed across shards.
  • Spend a lot of time planning your queries. Derive API from the queries and then carefully design tables. This is the most important and prolonged task.
  • In Cassandra, you can design a composite column key and get access to that key in O(1). Spend time working on them. This will be used to access searchable records instead secondary index.
  • Make use of wide rows. They are useful for storing time-stamped events.
  • Never perform full-scan or in fact any operation more than O(Log N) on such volume. If you require anything more than O(Log N), offload such operations to Map-Reduce algorithms.

Practice

  • Don't spend time building OS images or installing servers on physical machines. Use cloud based providers for quick prototyping. I worked with Amazon EC2 and can highly recommend it for its simplicity, reliability and speed of prototyping.
  • Windows machines tend to be slower during boot time and take considerably more resources being in Idle state. Consider using Unix-based OS. Personally, I found Ubuntu server to be a reliable OS, but moreover there is a pretty good community at askubuntu
  • Think about networking, nodes shall ideally be close to each other to allow fast gossiping and meta-data exchange.
  • Do not go into extreme cases: really wide column rows or exceptionally long column families (tables). Best performance is achieved in the sane boundaries - if db supports that many N rows by design, it doesn't mean it performs well.
  • Our search takes about 3-5 seconds, much is due to the intermediate nodes between UI and the database. Consider how to bring requests closer to the database.
  • Use a network load balancer. Choose an established one. We use HAProxy, which is simple, but dead fast. Never had problems with it.
  • Prefer simplicity to complex solutions.
  • Look for free open-source solutions, unless you are backed up by a corporation's size budget. Once you go more than several servers, the costs of infrastructure might go sky high.

I do not work for Amazon and have no relations with HAProxy and Ubuntu teams. This is a personal opinion rather than any sort of promotion.


If I was going to put this into SQL Server, I would suggest a table something like:

CREATE TABLE tcp_traffic
(
    tcp_traffic_id bigint constraint PK_tcp_traffic primary key clustered IDENTITY(1,1)
    , tcp_flags smallint    /* at most 9 bits in TCP, so use SMALLINT */
    , src_as int        /* Since there are less than 2 billion A.S.'s possible, use INT */
    , netxhop bigint    /* use a big integer for the IP address instead of storing
                             it as dotted-decimal */
    , unix_secs bigint  
    , src_mask int      /* an assumption */
    , tos tinyint       /* values are 0-255, see RFC 791 */
    , prot tinyint      /* values are 0-255, see RFC 790 */
    , input int         /* an assumption */
    , doctets int       /* an assumption */
    , engine_type int   /* an assumption */
    , exaddr bigint     /* use a big integer for the IP address instead of storing
                             it as dotted-decimal */
    , engine_id int     /* an assumption */
    , srcaddr bigint    /* use a big integer for the IP address instead of storing
                             it as dotted-decimal */
    , dst_as int        /* Since there are less than 2 billion A.S.'s possible, use INT */
    , unix_nsecs bigint /* an assumption */
    , sysuptime bigint  /* an assumption */
    , dst_mask int      /* an assumption */
    , dstport smallint  /* ports can be in the range of 0 - 32767 */
    , [last] bigint     /* an assumption */
    , srcport smallint  /* ports can be in the range of 0 - 32767 */
    , dpkts int         /* an assumption */
    , output int        /* an assumption */
    , dstaddr bigint    /* use a big integer for the IP address instead of storing
                            it as dotted-decimal */
    , [first] bigint    /* an assumption */
);

This results in a total estimated storage requirement for the single table, with no further indexes of 5.5 TB for 43.2 beeellion records (your specified requirement). This is calculated as 130 bytes for the data itself, plus 7 bytes per row of overhead, plus 96 bytes per page of overhead. SQL Server stores data in 8KB pages, allowing for 59 rows per page. This equates to 732,203,390 pages for a single month of data.

SQL Server likes writing to disk in 8-page chunks (64KB), which equates to 472 rows per physical I/O. With 16,203 flow records being generated every second, you will need a minimum I/O rate of 34 IOps, guaranteed each and every second. Although this by itself is not a huge amount, other I/O in the system (SQL Server and otherwise) needs to never infringe on this necessary rate of IOps. Therefore you'd need to design a system capable of at least an order-of-magnitude more IOps, or 340 sustained IOps - I would tend to estimate that you need 2 orders of magnitude more sustainable IOps to guarantee throughput.

You will notice I am not storing the IP addresses in their dotted-decimal form. This saves a huge amount on storage (7 bytes per address), and also makes indexing, retrieval, sorting, and comparing IP addresses far, far more efficient. The downside here is you need to convert the dotted-decimal IPs into 8-byte integers before storing them, and back to dotted-decimal IPs for display. The code to do so is trivial, however your row-rate this will add a substantial amount of processing overhead to each flow row being processed - you may want to do this conversion process on a physically different machine from SQL Server.

Discussing the indexes you require is a totally separate matter since you have not listed any specific requirements. The design of this table will store flow rows in the physical order they are received by SQL Server, the tcp_traffic_id field is unique for each record, and allows sorting rows by the order they were recorded (in this case most likely relating one-to-one to the time of the flow event).


I would recommend HBase. You can store all the raw data in one or more HBase tables, depending on what you need to query. HBase can handle large data-sets and does auto-sharding through region splits.

In addition, if you design row keys well, you can get extremely fast, even O(1) queries. Note that if you are retrieving a large data set, that is still going to be slow since retrieving data is an O(n) operation.

Since you want to query across each field, I would recommend creating a unique table for each of them. Example for the src_address data, have a table that looks like this:

1.2.3.4_timestamp1 : { data }
1.2.3.4_timestamp2 : { data }

So if you want to query for all data across 1.2.3.4 starting from Mar 27 12:00 AM to Mar 27 12:01 AM, you can do a range scan with the start and stop rows specified.

IMHO, the row key design is the most critical part of using HBase - if you design it well, you will be able to do fast queries AND store large volumes of data.