MySQL in AWS RDS 100% CPU on some queries

It was the swap! I ended up replicating the database on to the same hardware, and wrote some scripts to emulate live traffic on the database. I also ran some big queries to help fill up the buffer pool and then ensured that my replica database approximately matched the metrics of my production database. I then tried running large queries against it and it locked up, even with indexes applied. I could reproduce the issue without taking down the production service, so now I can break things as much as I want.

I noticed that running the same large queries earlier in the life of the replica database hadn't caused an issue, and tracked down the point where the locks ups begin. It happens almost immediately after the buffer pool gets large enough to push some data (OS, or otherwise) to swap on the t2.micro instance. Here's an image from Cloudwatch of the swap growing after the freeable memory drops below ~50MB or so:

Image of swap being utilized as freeable memory drops.

Any large queries (with or without an index) start to lock up the database after the red circle. You can see the aggregate 5 minute CPU usage when I locked up the database for almost a minute performing a DELETE.

With this theory in mind, I tried two solutions:

1) I changed the value of innodb_buffer_pool_size to 375M (instead of its default AWS value of 3/4 the instance RAM size). This reduces the maximum size of the buffer pool and ensures that the database memory footprint won't grow large enough to push the OS/etc into swap. This worked!

2) I tried running the database on a larger instance (2GiB of RAM). This also worked!

Both of these solutions work, and the bonus with (1) is that I don't have to spend any extra money. I'm working on tuning the value of innodb_buffer_pool_size so that it is as large as possible without causing swap. I ran the same DELETE query in 1.2s and the database continued responding. The below screenshot wasn't even possible with the production database since the database would stop responding during these long queries, so the dashboard would never update and eventually lose connection.

Screenshot of the MySQL dashboard while running a DELETE query


For starters, have better indexes. For the SELECT:

INDEX(ItemID, ItemStatus,   -- in either order
      `Time`)               -- then the range

For the DELETE:

INDEX(ItemStatus, `Time`)   -- in this order

(No, there is no single index that is optimal for both queries.)

This should help with performance (CPU, IOPS, "lock up") and may (or may not) help with other issues, such as the page_cleaners.

About how many rows in the table? (140K now?) In the resultset from that SELECT?

More...

innodb_file_per_table is a disk layout issue; not relevant. The main tunable is the cache in RAM: innodb_buffer_pool_size. But, with only 1GB of RAM, that cache must be quite small. This would explain why a 190K-row table would need IOPs.

Here's the scenario... Queries are only performed in the buffer_pool. That is, to read (or write or delete) a row, the 16KB block it lives in (or will live in) is first brought into the buffer_pool. The table mentioned might be about 50MB. I would guess (without knowing) that the buffer_pool_size is set for about 100MB. The SELECT and DELETE under discussion needed a full table scan, hence 50MB would be pulled into the buffer_pool, pushing out blocks of other table(s) if necessary. And vice versa when something big happens with those tables.

Plan A: Add the recommended indexes. There may be other ways to shrink the disk footprint to improve the cachability. For example, BIGINT is 8 bytes; INT is only 4 bytes, but has a limit of a few billion. But even MEDIUMINT (3 bytes, a few million), etc, may suffice.

Plan B: Get a bigger RDS setup.