How do MySQL indexes work?

Basically an index is a map of all your keys that is sorted in order. With a list in order, then instead of checking every key, it can do something like this:

1: Go to middle of list - is higher or lower than what I'm looking for?

2: If higher, go to halfway point between middle and bottom, if lower, middle and top

3: Is higher or lower? Jump to middle point again, etc.

Using that logic, you can find an element in a sorted list in about 7 steps, instead of checking every item.

Obviously there are complexities, but that gives you the basic idea.


Basically an index on a table works like an index in a book (that's where the name came from):

Let's say you have a book about databases and you want to find some information about, say, storage. Without an index (assuming no other aid, such as a table of contents) you'd have to go through the pages one by one, until you found the topic (that's a full table scan). On the other hand, an index has a list of keywords, so you'd consult the index and see that storage is mentioned on pages 113-120,231 and 354. Then you could flip to those pages directly, without searching (that's a search with an index, somewhat faster).

Of course, how useful the index will be, depends on many things - a few examples, using the simile above:

  • if you had a book on databases and indexed the word "database", you'd see that it's mentioned on pages 1-59,61-290, and 292 to 400. In such case, the index is not much help and it might be faster to go through the pages one by one (in a database, this is "poor selectivity").
  • For a 10-page book, it makes no sense to make an index, as you may end up with a 10-page book prefixed by a 5-page index, which is just silly - just scan the 10 pages and be done with it.
  • The index also needs to be useful - there's generally no point to index e.g. the frequency of the letter "L" per page.

The first thing you must know is that indexes are a way to avoid scanning the full table to obtain the result that you're looking for.

There are different kinds of indexes and they're implemented in the storage layer, so there's no standard between them and they also depend on the storage engine that you're using.

InnoDB and the B+Tree index

For InnoDB, the most common index type is the B+Tree based index, that stores the elements in a sorted order. Also, you don't have to access the real table to get the indexed values, which makes your query return way faster.

The "problem" about this index type is that you have to query for the leftmost value to use the index. So, if your index has two columns, say last_name and first_name, the order that you query these fields matters a lot.

So, given the following table:

CREATE TABLE person (
    last_name VARCHAR(50) NOT NULL,
    first_name VARCHAR(50) NOT NULL,
    INDEX (last_name, first_name)
);

This query would take advantage of the index:

SELECT last_name, first_name FROM person
WHERE last_name = "John" AND first_name LIKE "J%"

But the following one would not

SELECT last_name, first_name FROM person WHERE first_name = "Constantine"

Because you're querying the first_name column first and it's not the leftmost column in the index.

This last example is even worse:

SELECT last_name, first_name FROM person WHERE first_name LIKE "%Constantine"

Because now, you're comparing the rightmost part of the rightmost field in the index.

The hash index

This is a different index type that unfortunately, only the memory backend supports. It's lightning fast but only useful for full lookups, which means that you can't use it for operations like >, < or LIKE.

Since it only works for the memory backend, you probably won't use it very often. The main case I can think of right now is the one that you create a temporary table in the memory with a set of results from another select and perform a lot of other selects in this temporary table using hash indexes.

If you have a big VARCHAR field, you can "emulate" the use of a hash index when using a B-Tree, by creating another column and saving a hash of the big value on it. Let's say you're storing a url in a field and the values are quite big. You could also create an integer field called url_hash and use a hash function like CRC32 or any other hash function to hash the url when inserting it. And then, when you need to query for this value, you can do something like this:

SELECT url FROM url_table WHERE url_hash=CRC32("http://gnu.org");

The problem with the above example is that since the CRC32 function generates a quite small hash, you'll end up with a lot of collisions in the hashed values. If you need exact values, you can fix this problem by doing the following:

SELECT url FROM url_table 
WHERE url_hash=CRC32("http://gnu.org") AND url="http://gnu.org";

It's still worth to hash things even if the collision number is high cause you'll only perform the second comparison (the string one) against the repeated hashes.

Unfortunately, using this technique, you still need to hit the table to compare the url field.

Wrap up

Some facts that you may consider every time you want to talk about optimization:

  1. Integer comparison is way faster than string comparison. It can be illustrated with the example about the emulation of the hash index in InnoDB.

  2. Maybe, adding additional steps in a process makes it faster, not slower. It can be illustrated by the fact that you can optimize a SELECT by splitting it into two steps, making the first one store values in a newly created in-memory table, and then execute the heavier queries on this second table.

MySQL has other indexes too, but I think the B+Tree one is the most used ever and the hash one is a good thing to know, but you can find the other ones in the MySQL documentation.

I highly recommend you to read the "High Performance MySQL" book, the answer above was definitely based on its chapter about indexes.

Tags:

Mysql

Indexing