Why is InnoDB table size much larger than expected?

Here is some more info you might find useful.

InnoDB allocates data in terms of 16KB pages, so 'SHOW TABLE STATUS' will give inflated numbers for row size if you only have a few rows and the table is < 16K total. (For example, with 4 rows the average row size comes back as 4096.)

The extra 6 bytes per row for the "invisible" primary key is a crucial point when space is a big consideration. If your table is only one column, that's the ideal column to make the primary key, assuming the values in it are unique:

CREATE TABLE `mytest2`
       (`num1` int(10) unsigned NOT NULL primary key)
ENGINE=InnoDB DEFAULT CHARSET=latin1;

By using a PRIMARY KEY like this:

  1. No INDEX or KEY clause is needed, because you don't have a secondary index. The index-organized format of InnoDB tables gives you fast lookup based on the primary key value for free.
  2. You don't wind up with another copy of the NUM1 column data, which is what happens when that column is indexed explicitly.
  3. You don't wind up with another copy of the 6-byte invisible primary key values. The primary key values are duplicated in each secondary index. (That's also the reason why you probably don't want 10 indexes on a table with 10 columns, and you probably don't want a primary key that combines several different columns or is a long string column.)

So overall, sticking with just a primary key means less data associated with the table + indexes. To get a sense of overall data size, I like to run with

set innodb_file_per_table = 1;

and examine the size of the data/database/*table*.ibd files. Each .ibd file contains the data for an InnoDB table and all its associated indexes.

To quickly build up a big table for testing, I usually run a statement like so:

insert into mytest
select * from mytest;

Which doubles the amount of data each time. In the case of the single-column table using a primary key, since the values had to be unique, I used a variation to keep the values from colliding with each other:

insert into mytest2
select num1 + (select count(*) from mytest2) from mytest2;

This way, I was able to get average row size down to 25. The space overhead is based on the underlying assumption that you want to have fast lookup for individual rows using a pointer-style mechanism, and most tables will have a column whose values serve as pointers (i.e. the primary key) in addition to the columns with real data that gets summed, averaged, and displayed.


InnoDB tables are clustered, that means that all data are contained in a B-Tree with the PRIMARY KEY as a key and all other columns as a payload.

Since you don't define an explicit PRIMARY KEY, InnoDB uses a hidden 6-byte column to sort the records on.

This and overhead of the B-Tree organization (with extra non-leaf-level blocks) requires more space than sizeof(int) * num_rows.


IN addition to Quassnoi's very fine answer, you should probably try it out using a significant data set.

What I'd do is, load 1M rows of simulated production data in, then measure the table size and use that as a guide.

That's what I've done in the past anyway

Tags:

Mysql

Innodb