Monday morning mistake: sudo rm -rf --no-preserve-root /

Solution 1:

Fact is? At this point, there's no simple/easy automatic fix for this. Data recovery is a science and even the basic, common tools need someone to sit down and ensure the data is there. If you're expecting to recover from this without massive amounts of downtime, you're going to be disappointed.

I'd suggest using testdisk or some file system specific recovery tool. Try one system, see if it works, and so on. There's no real way to automate the process but you can probably carefully do it in batches.

That said, there is a few very scary things in the questions and comments that ought to be part of your after action reports.

Firstly, you ran the command everywhere without checking it first. Run a command on one box. Then a few, then more. Basically if something goes wrong, its better to have it affect a few rather than all your systems.

Secondly

@Tim how to do a backup without mounting a remote drive on the server?

Scares me. File level one way backups are a solved problem. Rsync can be used to preserve permissions and copy over files one way to a backup site. Accidentally something? Reinstall (preferably automatically) rsync back, and things work. In future, you might use file system level snapshots with btrfs or zfs snapshots and shipping those for system level backups. I'd actually toy with separating application servers, databases and storage and introduce the principle of least privilege so you would split up the risk of something like this..

I know there is anything I can do. I now need to think how to protect myself

After something has happened is the worst time to consider this.

What can we learn from this?

  1. Backups save data. Possibly careers.
  2. If you have a tool and arn't aware if what it can do, its dangerous. A jedi can do amazing things with a lightsaber. A roomful of chimps with lightsabers... would get messy.
  3. Never run a command everywhere at once. Seperate out test and production machines, and preferably do production machines in stages. Its better to fix 1 or 10 machines rather than 100 or 1000.

  4. Double and triple check commands. There's no shame in asking a co worker to double check "hey, I'm about to dd a drive, could you sanity check this so I don't end up wiping a drive?". A wrapper might help as well, but nothing beats a less tired set of eyes.

What can you do now? Get an email out to customers. Let them know there's downtime and there's catastrophic failures. Talk to your higher ups, legal, sales and such and see how you can mitigate the damage. Start planning for recovery, and if needed you're going to have to, at best, hire extra hands. At worst, plan for spending a lot of money on recovery. At this stage, you're going to work at mitigating the fall out as well as technical fixes.

Solution 2:

Boot into the rescue system provided by Hetzner and check what damage you have done.
Transfer out any files to a safe location and redeploy the server afterwards.

I'm afraid that is the best solution in your case.


Solution 3:

When you delete stuff with rm -rf --no-preserve-root, its nigh impossible to recover. It's very likely you've lost all the important files.

As @faker said in his answer, the best course of action is to transfer the files to a safe location and redeploy the server afterwards.

To avoid similar situations in future, I'd suggest you:

  • Take backups weekly, or at least fortnightly. This would help you in getting the affected service back up with the least possible MTTR.

  • Don't work as root when not needed. And always think twice before doing anything. I'd suggest you also install safe-rm.

  • Don't type options that you don't intend to invoke, such as --no-preserve-root or --permission-to-kill-kittens-explicitly-granted, for that matter.


Solution 4:

I've had the same issue but just testing with a harddrive, I've lost everything. I don't know if it'll be useful but don't install anything, don't overwrite your data, you need to mount your hard drives and launch some forensics tools such us autopsy, photorec, Testdisk.

I strongly recommend Testdisk, with some basics command you can recover your data if you didn't overwrite it.


Solution 5:

The best way to fix a problem like this is to not have it in the first place.

Do not manually enter an "rm -rf" command that has a slash in the argument list. (Putting such commands in a shell script with really good validation/sanity routines to protect you from doing something stupid is different.)

Just don't do it.
Ever. If you think you need to do it, you aren't thinking hard enough.

Instead, change your working directory to the parent of the directory from which you intend to start the removal, so that the target of the rm command does not require a slash:

cd /mnt

sudo rm -rf hetznerbackup