Delete 10M+ files from ZFS, effectively

Solution 1:

Deletes in ZFS are expensive. Even more so if you have deduplication enabled on the filesystem (since dereferencing deduped files is expensive). Snapshots could complicate matters too.

You may be better off deleting the /tmp directory instead of the data contained within.

If /tmp is a ZFS filesystem, delete it and create again.

Solution 2:

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days?

Consider an office building.

Removing all of the computers and furniture and fixings from all the offices on all the floors takes a long time, but leaves the offices immediately usable by another client.

Demolishing the whole building with RDX is a whole lot quicker, but the next client is quite likely to complain about how drafty the place is.

Solution 3:

There's a number of things going on here.

First, all modern disk technologies are optimised for bulk transfers. If you need to move 100MB of data, they'll do it much faster if they're in one contiguous block instead of scattered all over the place. SSDs help a lot here, but even they prefer data in contiguous blocks.

Second, resilvering is pretty optimal as far as disk operations goes. You read a massive contiguous chunk of data from one disk, do some fast CPU ops on it, then rewrite it in another big contiguous chunk to another disk. If power fails partway through, no big deal - you'll just ignore any data with bad checksums and carry on as per normal.

Third, deleting a file is really slow. ZFS is particularly bad, but practically all filesystems are slow to delete. They must modify a large number of different chunks of data on the disk and time it correctly (i.e. wait) so the filesystem is not damaged if power fails.

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days?

Resilvering is something that disks are really fast at, and deletion is something that disks are slow at. Per megabyte of disk, you only have to do a little bit of resilvering. You might have a thousand files in that space which need to be deleted.

70 deletions/second seems very very bad performance

It depends. I would not be surprised by this. You haven't mentioned what type of SSD you're using. Modern Intel and Samsung SSDs are pretty good at this sort of operation (read-modify-write) and will perform better. Cheaper/older SSDs (e.g. Corsair) will be slow. The number of I/O operations per second (IOPS) is the determining factor here.

ZFS is particularly slow to delete things. Normally, it will perform deletions in the background so you don't see the delay. If you're doing a huge number of them it can't hide it and must delay you.

Appendix: why are deletions slow?

  • Deleting a file requires a several steps. The file metadata must be marked as 'deleted', and eventually it must be reclaimed so the space can be reused. ZFS is a 'log structured filesystem' which performs best if you only ever create things, never delete them. The log structure means that if you delete something, there's a gap in the log and so other data must be rearranged (defragmented) to fill the gap. This is invisible to the user but generally slow.
  • The changes must be made in such a way that if power were to fail partway through, the filesystem remains consistent. Often, this means waiting until the disk confirms that data really is on the media; for an SSD, that can take a long time (hundreds of milliseconds). The net effect of this is that there is a lot more bookkeeping (i.e. disk I/O operations).
  • All of the changes are small. Instead of reading, writing and erasing whole flash blocks (or cylinders for a magnetic disk) you need to modify a little bit of one. To do this, the hardware must read in a whole block or cylinder, modify it in memory, then write it out to the media again. This takes a long time.

Solution 4:

Ian Howson gives a good answer on why it is slow.

If you delete files in parallel you may see an increase in speed due to the deletion may use the same blocks and thus can save rewriting the same block many times.

So try:

find /tmp -print0 | parallel -j100 -0 -n100 rm

and see if that performs better than your 70 deletes per second.

Solution 5:

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days?

It is possible because the two operations work on different layers of the file system stack. Resilvering can run low-level and does not actually need to look at individual files, copying large chunks of data at a time.

Why do I have so bad performance? 70 deletions/second seems very very bad performance.

It does have to do a lot of bookkeeping...

I could delete the inode for /tmp2 manually, but that will not free up the space, right?

I don't know for ZFS, but if it could automatically recover from that, it would likely, in the end, do the same operations you are already doing, in the background.

Could this be a problem with zfs, or the hard drives or what?

Does zfs scrub say anything?