How to optimize very slow SELECT with LEFT JOINs over big tables

Pick a few attributes to include in person. Index them in a few combinations -- use composite indexes, not single-column indexes.

That is essentially the only way out of EAV-sucks-at-performance, which is where you are.

Here is more discussion: http://mysql.rjweb.org/doc.php/eav including a suggestion of using JSON instead of the key-value table.


I hope I found a sufficient solution. It's inspired by this article.

Short answer:

  1. I've created 1 table with all the attributes. One column for one attribute. Plus primary key column.
  2. Attribute values are stored in text cells (for full-text searching) in CSV-like format.
  3. Created full-text indexes. Before that it's important to set ft_min_word_len=1 (for MyISAM) in [mysqld] section and innodb_ft_min_token_size=1 (for InnoDb) in my.cnf file, restart mysql service.
  4. Searching example: SELECT * FROM person_index WHERE MATCH(attribute_1) AGAINST("123 456 789" IN BOOLEAN MODE) LIMIT 1000 where 123, 456 a 789 are IDs which persons should have associated in attribute_1. This query took under 1 sec.

Detailed answer:

Step 1. Creating table with fulltext indexes. InnoDb supports fulltext indexes from MySQL 5.7 so if you use 5.5 or 5.6, you should use MyISAM. It's sometimes even faster for FT searching than InnoDb.

CREATE TABLE `person_attribute_ft` (
  `person_id` int(11) NOT NULL,
  `attr_1` text,
  `attr_2` text,
  `attr_3` text,
  `attr_4` text,
  PRIMARY KEY (`person_id`),
  FULLTEXT KEY `attr_1` (`attr_1`),
  FULLTEXT KEY `attr_2` (`attr_2`),
  FULLTEXT KEY `attr_3` (`attr_3`),
  FULLTEXT KEY `attr_4` (`attr_4`),
  FULLTEXT KEY `attr_12` (`attr_1`,`attr_2`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

Step 2. Insert data from EAV (entity-attribute-value) table. For example stated in question it can be done with 1 simple SQL:

INSERT IGNORE INTO `person_attribute_ft`
SELECT
    p.person_id,
    (SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 1 AND a.person_id = p.person_id LIMIT 10) attr_1,
    (SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 2 AND a.person_id = p.person_id LIMIT 10) attr_2,
    (SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 3 AND a.person_id = p.person_id LIMIT 10) attr_3,
    (SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 4 AND a.person_id = p.person_id LIMIT 10) attr_4
FROM person p

Result should be something like this:

mysql> select * from person_attribute_ft limit 10;
+-----------+--------+--------+--------+--------+
| person_id | attr_1 | attr_2 | attr_3 | attr_4 |
+-----------+--------+--------+--------+--------+
|         1 | 541    | 2      | 1927   | 3      |
|         2 | 2862   | 2      | 1939   | 4      |
|         3 | 6573   | 2      | 1904   | 2      |
|         4 | 2432   | 1      | 2005   | 2      |
|         5 | 2208   | 1      | 1995   | 4      |
|         6 | 8388   | 2      | 1973   | 1      |
|         7 | 107    | 2      | 1909   | 4      |
|         8 | 5161   | 1      | 2005   | 1      |
|         9 | 8022   | 2      | 1953   | 4      |
|        10 | 4801   | 2      | 1900   | 3      |
+-----------+--------+--------+--------+--------+
10 rows in set (0.00 sec)

Step 3. Select from table with query like this:

mysql> SELECT SQL_NO_CACHE *
    -> FROM `person_attribute_ft`
    -> WHERE 1 AND MATCH(attr_1) AGAINST ("3000 3001 3002 3003 3004 3005 3006 3007" IN BOOLEAN MODE)
    -> AND MATCH(attr_2) AGAINST ("1" IN BOOLEAN MODE)
    -> AND MATCH(attr_3) AGAINST ("1980 1981 1982 1983 1984" IN BOOLEAN MODE)
    -> AND MATCH(attr_4) AGAINST ("2,3" IN BOOLEAN MODE)
    -> LIMIT 10000;
+-----------+--------+--------+--------+--------+
| person_id | attr_1 | attr_2 | attr_3 | attr_4 |
+-----------+--------+--------+--------+--------+
|     12131 | 3002   | 1      | 1982   | 2      |
|     51315 | 3007   | 1      | 1984   | 2      |
|    147283 | 3001   | 1      | 1984   | 2      |
|    350086 | 3005   | 1      | 1982   | 3      |
|    423907 | 3004   | 1      | 1982   | 3      |
... many rows ...
|   9423907 | 3004   | 1      | 1982   | 3      |
|   9461892 | 3007   | 1      | 1982   | 2      |
|   9516361 | 3006   | 1      | 1980   | 2      |
|   9813933 | 3005   | 1      | 1982   | 2      |
|   9986892 | 3003   | 1      | 1981   | 2      |
+-----------+--------+--------+--------+--------+
90 rows in set (0.17 sec)

The query selects all the rows:

  • matching at least one of these IDs in attr_1: 3000, 3001, 3002, 3003, 3004, 3005, 3006 or 3007
  • AND at the same time matching 1 in attr_2 (this column represents gender so if this solution were customized, it should be smallint(1) with simple index, etc...)
  • AND at the same time matching at least one of 1980, 1981, 1982, 1983 or 1984 in attr_3
  • AND at the same time matching 2 or 3 in attr_4

Conclusion:

I know this solution is not perfect and ideal for many situations but can be used as good alternative for EAV table design.

I hope it'll help someone.


Add indeces to attribute for:

  • (person_id, attribute_type_id, attribute_value) and
  • (attribute_type_id, attribute_value, person_id)

Explanation

With your current design EXPLAIN expects your query to examine 1,265,229 * 4 * 4 * 4 = 80,974,656 rows in attribute. You can reduce this number by adding a composite index on attribute for (person_id, attribute_type_id). Using this index your query will only examine 1 instead of 4 rows for each of location, eyecolor and gender.

You could extend that index to include attribute_type_value as well: (person_id, attribute_type_id, attribute_value). This would turn this index into a covering index for this query, which should improve performance as well.

Furthermore adding an index on (attribute_type_id, attribute_value, person_id) (again a covering index by including person_id) should improve performance over just using an index on attribute_value where more rows would have to be examined. In this case it will fasten the first step in your explain: selecting a range from bornyear.

Using those two indeces lowered the execution time of your query on my system from ~2.0 s to ~0.2 s with the explain output looking like this:

+----+-------------+----------+--------+-------------------------------------+-------------------+---------+--------------------------------+---------+----------+--------------------------+
| id | select_type | table    | type   | possible_keys                       | key               | key_len | ref                            |    rows | filtered | Extra                    |
+----+-------------+----------+--------+-------------------------------------+-------------------+---------+--------------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | bornyear | range  | person_type_value,type_value_person | type_value_person |       9 |                                | 1861881 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | location | ref    | person_type_value,type_value_person | person_type_value |       8 | bornyear.person_id,const       |       1 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | eyecolor | ref    | person_type_value,type_value_person | person_type_value |       8 | bornyear.person_id,const       |       1 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | gender   | ref    | person_type_value,type_value_person | person_type_value |      13 | bornyear.person_id,const,const |       1 |   100.00 | Using index              |
|  1 | SIMPLE      | person   | eq_ref | PRIMARY                             | PRIMARY           |       4 | bornyear.person_id             |       1 |   100.00 | Using index              |
+----+-------------+----------+--------+-------------------------------------+-------------------+---------+--------------------------------+---------+----------+--------------------------+