Possible INDEX on a VARCHAR field in MySql

SUGGESTION #1 : Standard Indexing

CREATE TABLE mytable
(
    id int not null auto_increment,
    myfield varchar(255) not null,
    primary key (id),
    key (myfield)
);

If you index like this, you can either look for the whole string or do left-oriented LIKE searches

SUGGESTION #2 : FULLTEXT Indexing

CREATE TABLE mytable
(
    id int not null auto_increment,
    myfield varchar(255) not null,
    primary key (id),
    fulltext (myfield)
);

You can effectively use searches for individual keywords as well as whole phrases. You will need to define a custom stopword list because MySQL will not index 543 words.

Here are my other posts from the past two years on FULLTEXT indexes

May 23, 2011 : Optimizing mysql fulltext search (StackOverflow)
Oct 25, 2011 : FULLTEXT index ignored in BOOLEAN MODE with 'number of words' conditional
Jan 26, 2012 : Mysql fulltext search my.cnf optimization
May 07, 2012 : MySQL EXPLAIN doesn't show 'use index' for FULLTEXT

SUGGESTION #3 : Hash Indexing

CREATE TABLE mytable
(
    id int not null auto_increment,
    myfield varchar(255) not null,
    hashmyfield char(32) not null,
    primary key (id),
    key (hashmyfield)
);

If you are looking for one specific value and those values could be lengths well beyond 32 characters, you could store the hash value:

INSERT INTO mytable (myfield,hashmyfield)
VALUES ('whatever',MD5('whatever'));

That way, you just search for hash values to retrieve results

SELECT * FROM mytable WHERE hashmyfield = MD5('whatever');

Give it a Try !!!

MySQL enables you to define prefixed index which means you define first N characters from original string to be indexed, and the trick is to choose a number N that’s long enough to give good selectivity, but short enough to save space. The prefix should be long enough to make the index nearly as useful as it would be if you’d indexed the whole column.

Before we go further let us define some important terms. Index selectivity is ratio of the total distinct indexed values and total number of rows. Here is one example for test table:

+-----+-----------+
| id  | value     |
+-----+-----------+
| 1   | abc       |
| 2   | abd       |
| 3   | adg       |
+-----+-----------+

If we index only the first character (N=1), then index table will look like the following table:

+---------------+-----------+
| indexedValue  | rows      |
+---------------+-----------+
| a             | 1,2,3     |
+---------------+-----------+

In this case, index selectivity is equal to IS=1/3 = 0.33.

Let us now see what will happen if we increase number of indexed characters to two (N=2).

+---------------+-----------+
| indexedValue  | rows      |
+---------------+-----------+
| ab             | 1,2      |
| ad             | 3        |
+---------------+-----------+

In this scenario IS=2/3=0.66 which means we increased index selectivity, but we have also increased the size of index. Trick is to find the minimal number N which will result to maximal index selectivity.

There are two approaches you can do calculations for your database table. I will make demonstration on the this database dump.

Let's say we want to add column last_name in table employees to the index, and we want to define the smallest number N which will produce the best index selectivity.

First let us identify the most frequent last names:

select count(*) as cnt, last_name from employees group by employees.last_name order by cnt

+-----+-------------+
| cnt | last_name   |
+-----+-------------+
| 226 | Baba        |
| 223 | Coorg       |
| 223 | Gelosh      |
| 222 | Farris      |
| 222 | Sudbeck     |
| 221 | Adachi      |
| 220 | Osgood      |
| 218 | Neiman      |
| 218 | Mandell     |
| 218 | Masada      |
| 217 | Boudaillier |
| 217 | Wendorf     |
| 216 | Pettis      |
| 216 | Solares     |
| 216 | Mahnke      |
+-----+-------------+
15 rows in set (0.64 sec)

As you can see, the last name Baba is the most frequent one. Now we are going to find the most frequently occurring last_name prefixes, beginning with five-letter prefixes.

+-----+--------+
| cnt | prefix |
+-----+--------+
| 794 | Schaa  |
| 758 | Mande  |
| 711 | Schwa  |
| 562 | Angel  |
| 561 | Gecse  |
| 555 | Delgr  |
| 550 | Berna  |
| 547 | Peter  |
| 543 | Cappe  |
| 539 | Stran  |
| 534 | Canna  |
| 485 | Georg  |
| 417 | Neima  |
| 398 | Petti  |
| 398 | Duclo  |
+-----+--------+
15 rows in set (0.55 sec)

There are much more occurences of every prefix, which means we have to increase number N until the values are almost the same as in the previous example.

Here are results for N=9

select count(*) as cnt, left(last_name,9) as prefix from employees group by prefix order by cnt desc limit 0,15;

+-----+-----------+
| cnt | prefix    |
+-----+-----------+
| 336 | Schwartzb |
| 226 | Baba      |
| 223 | Coorg     |
| 223 | Gelosh    |
| 222 | Sudbeck   |
| 222 | Farris    |
| 221 | Adachi    |
| 220 | Osgood    |
| 218 | Mandell   |
| 218 | Neiman    |
| 218 | Masada    |
| 217 | Wendorf   |
| 217 | Boudailli |
| 216 | Cummings  |
| 216 | Pettis    |
+-----+-----------+

Here are results for N=10.

+-----+------------+
| cnt | prefix     |
+-----+------------+
| 226 | Baba       |
| 223 | Coorg      |
| 223 | Gelosh     |
| 222 | Sudbeck    |
| 222 | Farris     |
| 221 | Adachi     |
| 220 | Osgood     |
| 218 | Mandell    |
| 218 | Neiman     |
| 218 | Masada     |
| 217 | Wendorf    |
| 217 | Boudaillie |
| 216 | Cummings   |
| 216 | Pettis     |
| 216 | Solares    |
+-----+------------+
15 rows in set (0.56 sec)

This are very good results. This means that we can make index on column last_name with indexing only first 10 characters. In table definition column last_name is defined as VARCHAR(16), and this means we have saved 6 bytes (or more if there are UTF8 characters in the last name) per entry. In this table there are 1637 distinct values multiplied by 6 bytes is about 9KB, and imagine how this number would grow if our table contains million of rows.

You can read other ways of calculating number of N in my post Prefixed indexes in MySQL.

Using MD5 and SHA1 functions in order to generate values which should be indexed is also not good approach. Why? Read it in post How to choose right data type for a primary key in MySQL database

Possible INDEX on a VARCHAR field in MySql

SUGGESTION #1 : Standard Indexing

SUGGESTION #2 : FULLTEXT Indexing

SUGGESTION #3 : Hash Indexing

Tags:

Mysql

Varchar

Full Text Search

Index

Related

Recent Posts