Autoincrement vs composite primary key for a big table on innodb

There is nothing wrong with a composite key. However, you have to take into account how InnoDB stores data.

Quoting the above linked documentation:

The data in each InnoDB table is divided into pages. The pages that make up each table are arranged in a tree data structure called a B-tree index. Table data and secondary indexes both use this type of structure. The B-tree index that represents an entire table is known as the clustered index, which is organized according to the primary key columns. The nodes of the index data structure contain the values of all the columns in that row (for the clustered index) or the index columns and the primary key columns (for secondary indexes).

That is, InnoDB will store the data according to your PRIMARY KEY. If the data you're inserting has an increasing PK, page fragmentation does not occur. That will happen always with an AUTO_INCREMENT. If you're inserting the data in chronological order (i.e. gDateTime is always monotonically increasing), changing the order of the columns that make out your PK to:

PRIMARY KEY (`gDateTime`, `alarmTypeID`, `vehicleID`)

... will have the same advantages, with regard to not having to "fit a new row in the middle of others" (which means, the B-tree isn't fragmented for every insert).

However: if you reference this table from other (related) tables, you have to store, in the referencing table always the PK (gDateTime, alarmTypeID, vehicleID). This means you're saving every time 7 or 8 bytes of storage. The composite PK would use 2 + 1 + 8 = 11 bytes of information (probably it uses 12 bytes due to alignment); whereas an INT UNSIGNED AUTO_INCREMENT, you'll use only 4 bytes in the referencing table. You're limited to a 2^32 different values for your PK. If you need more than 2^32 values, you'll need BIGINT AUTO_INCREMENT, that gives you 2^64 (and I haven't found yet a practical case where this isn't big enough).

Whether this makes sense or not, depends a lot on your particular scenario.


joanolo has some good points, and some points I will disagree with...

  • As of 5.6.4, DATETIME and TIMESTAMP, without fractional seconds, each take 5 bytes. (So the PK in question is a total of 9 bytes.)
  • Fragmentation in the data is not that bad. And, if it allows for significant improvement in other actions, it may be worth it. (See below.) A BTree inherently settles down to about 69% full. (A block split turns a 100% full block into two 50% full blocks, then they both gradually refill.)
  • Using a DATETIME (or TIMESTAMP) in a PRIMARY or UNIQUE key is dangerous -- what if two entries happen at exactly the same time? (This question is application dependent; for example, measuring the location of a truck does not need two gps readings within a second.)
  • The link about PKs talks about "fat" PKs. The PK in question is only 9 bytes--not really fat. So the link is only mildly relevant. Furthermore, fatness only applies when you have at least 2 secondary indexes that do not include the fat columns.
  • The table is threatening to overflow a 4-byte INT, the next choice for AUTO_INCREMENT is an 8-byte BIGINT; not much different than 9 bytes.
  • MEDIUMINT is 3 bytes (vehicleID).
  • I'm pretty sure there is no "alignment" of fields in InnoDB structures. InnoDB is designed such that the files are compatible across all hardware architectures.
  • MySQL requires that the PK be unique. If dropping out alarmTypeID removes uniqueness, do not do it!

Specifics...

ADD PRIMARY KEY (`alarmTypeID`,`vehicleID`,`gDateTime`), -- 1+3+5 = 0 bytes
ADD KEY `gDateTime` (`gDateTime`),                       -- 5 + 1+3 = 9
ADD KEY `fleetID` (`fleetID`,`vehicleID`,`gDateTime`);   -- 2+3+5 + 1 = 11

I say 0 bytes for the PK because it is included with the rest of the columns. The numbers for secondary keys are the sizes of the secondary key columns + extra PK columns. (There is, of course, significant overhead in an index, so these numbers can't be used to compute the ultimate size of the BTree. You might need a fudge-factor of 3x.)

A SELECT with

WHERE alarmTypeID = constant
  AND vehicleID = constant
  AND gDateTime ... (some range)

is much better handled by (alarmTypeID,vehicleID,gDateTime) than by (gDateTime, alarmTypeID, vehicleID). If this is a common query, I contend that it outweighs the desire to avoid fragmentation.

PRIMARY KEY(alarmTypeID,vehicleID,gDateTime) avoids bouncing between the secondary key and the data.

PRIMARY KEY(gDateTime, alarmTypeID, vehicleID) cannot use alarm or vehicle, and would have to step over alarms and vehicles that are not of interest. Or use a secondary key, leading to bouncing back and forth. In either case, much slower. (Rule of Thumb: 10x slower for spinning disks when the data is not cached.)