EC2 - Hardware Failure

In some cases, Amazon will notice their hardware is in a degraded state and tell you to get off of it (stop and start your instance) by a certain date or it will be stopped automatically.

In some cases, there will be no warning and it will just stop. Or not enter STOP state, and simply become unreachable. It may or may not reboot after they take care of it. Sometimes, there will be an apology mail after the fact.

I have yet to have an EBS volume fail on me (I've had many instance go weird, but not volumes), but still plan for that. I don't know what that looks like.

Setting an alarm for the Reachability status check failing is your best bet.


AWS is unlikely to restart your instance. They give you all the tools to monitor and restart instances so they leave it to you. They may email you if you need to do something. If you stop then start your instance it will move to new hardware, but a restart will not move it to new hardware. Restarts of my Amazon Linux instance typically take a minute or so.

You shouldn't lose data from your EBS disk if EC2 hardware fails, as EBS volumes are stored redundantly within a single availability zone. EBS Snapshots are stored in S3, which stores data across three availability zones within a single region, so they're significantly more robust. Snapshots can be automated to be taken hourly, daily, weekly, etc, using a variety of tools. The first snapshot is large, subsequent are differentials are said to use relatively little space. In my experience snapshots close together use little space, but over time they do add up both size and cost, so I regularly delete snapshots I don't need.

As well as snapshots You should also take application level backups using an application like Borg Backup, Restic, or a commercial tool.

You can create an alarm in CloudWatch that reboots your instance if StatusCheckFailed is raised. The documentation with step by step instructions is here.


I just had an EBS fail which brought down both EC2s running a supposedly fault tolerant service on Elastic Beanstalk.

The symptoms were HTTP GET requests still worked, but POSTs failed. This means our GET-based health checks didn't detect any problems, but users could not log in as the login process used a POST.

Looking through the logs, there were many messages in /var/log/messages about I/O errors.

EXT4-fs warning (device dm-3): ext4_end_bio:314: I/O error -28 writing to inode 5905292 (offset 3198976 size 4096 starting block 1475341)
Buffer I/O error on device dm-3, logical block 1475341
EXT4-fs warning (device dm-3): ext4_end_bio:314: I/O error -28 writing to inode 5905292 (offset 0 size 0 starting block 1475340)
Buffer I/O error on device dm-3, logical block 1475340
JBD2: Detected IO errors while flushing file data on dm-3-8
EXT4-fs warning (device dm-3): ext4_end_bio:314: I/O error -28 writing to inode 5905292 (offset 0 size 0 starting block 1475341)
Buffer I/O error on device dm-3, logical block 1475341

There were messages in the nginx logs complaining that POST requests failed due to the resulting read-only state of the filesystem.

open() "/var/lib/nginx/body/0000522584" failed (30: Read-only file system)
open() "/var/lib/nginx/body/0000522585" failed (30: Read-only file system)
open() "/var/lib/nginx/body/0000522586" failed (30: Read-only file system)

What seems to have happened is the usual Linux behaviour, which is that a failed disk drops the filesystem into read-only mode, which then prevented nginx from creating any temporary files to store POST data. But GETs worked fine because reading the filesystem was still ok.

Interestingly because the health checks were reporting everything was good, Elastic Beanstalk didn't terminate and recreate any EC2 instances, even though approx 35% of requests were failing with HTTP 500 errors.

Lessons learned? Make sure your health check URLs attempt to write to the same filesystem used by other processes on your EC2, so that a failed disk will also cause the health check to fail. Otherwise the problem may not get detected automatically and could require manual intervention.