How to update 10 million+ rows in MySQL single table as Fast as possible?

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `Column2` varchar(14) NOT NULL,
  `Column3` varchar(14) NOT NULL,
  `Column4` varchar(14) NOT NULL,
  `Column5` DATE NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'
    -> INTO TABLE example
    -> FIELDS TERMINATED BY ','
    -> LINES TERMINATED BY '\n'
    -> IGNORE 1 LINES
    -> (id, Column3,Column4, @Column5)
    -> set
    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5    |
+----+---------+---------+---------+------------+
|  1 |         | Column2 | Column3 | 0000-00-00 |
|  2 |         | B       | Bar     | 0000-00-00 |
|  3 |         | C       | Foo     | 0000-00-00 |
|  4 |         | D       | Bar     | 0000-00-00 |
|  5 |         | E       | FOObar  | 0000-00-00 |
+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.


In light of all the things mentioned, it looks like the bottleneck is the join itself.

ASPECT #1 : Join Buffer Size

In all likelihood, your join_buffer_size is probably too low.

According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache

We only store the used columns in the join buffer, not the whole rows.

This being the case, make the keys of the join buffer stay in RAM.

You have 10 million rows times 4 bytes for each key. That's about 40M.

Try bumping it up in the session to 42M (a little bigger than 40M)

SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b 
ON a.field1 = b.field1 
SET 
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;

If this does the trick, proceed to add this to my.cnf

[mysqld]
join_buffer_size = 42M

Restarting mysqld not required for new connections. Just run

mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;

ASPECT #2 : Join Operation

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.

[mysqld]
innodb_write_io_threads = 16

You will have to restart MySQL for this change

GIVE IT A TRY !!!


  1. CREATE TABLE that matches the CSV
  2. LOAD DATA into that table
  3. UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
  4. DROP TABLE csv_table;

Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...

If nothing else is writing to the table, then...

  1. CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.
  2. LOAD DATA into that table
  3. copy the real_table to new_table (CREATE ... SELECT)
  4. UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
  5. RENAME TABLE real_table TO old, new_table TO real_table;
  6. DROP TABLE csv_table, old;

Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".