ZFS performance: do I need to keep free space in a pool or a file system?

Solution 1:

The performance degradation occurs when your zpool is either very full or very fragmented. The reason for this is the mechanism of free block discovery employed with ZFS. Opposed to other file systems like NTFS or ext3, there is no block bitmap showing which blocks are occupied and which are free. Instead, ZFS divides your zvol into (usually 200) larger areas called "metaslabs" and stores AVL-trees1 of free block information (space map) in each metaslab. The balanced AVL tree allows for an efficient search for a block fitting the size of the request.

While this mechanism has been chosen for reasons of scale, unfortunately it also turned out to be a major pain when a high level of fragmentation and/or space utilization occurs. As soon as all metaslabs carry a significant amount of data, you get a large number of small areas of free blocks as opposed to a small numbers of large areas when the pool is empty. If ZFS then needs to allocate 2 MB of space, it starts reading and evaluating all metaslabs' space maps to either find a suitable block or a way to break up the 2 MB into smaller blocks. This of course takes some time. What is worse is the fact that it will cost a whole lot of I/O operations as ZFS would indeed read all space maps off the physical disks. For any of your writes.

The drop in performance might be significant. If you fancy pretty pictures, take a look at the blog post over at Delphix which has some numbers taken off an (oversimplified but yet valid) zfs pool. I am shamelessly stealing one of the graphs - look at the blue, red, yellow, and green lines in this graph which are (respectively) representing pools at 10%, 50%, 75%, and 93% capacity drawn against write throughput in KB/s while becoming fragmented over time: zpool performance degradation

A quick & dirty fix to this has traditionally been the metaslab debugging mode (just issue echo metaslab_debug/W1 | mdb -kw at run-time for instantly changing the setting). In this case, all space maps would be kept in the OS RAM, removing the requirement for excessive and expensive I/O on each write operation. Ultimately, this also means you need more memory, especially for large pools, so it is kind of a RAM for storage horse-trade. Your 10 TB pool probably will cost you 2-4 GB of memory2, but you will be able to drive it to 95% of utilization without much hassle.

1 it is a bit more complicated, if you are interested, look at Bonwick's post on space maps for details

2 if you need a way to calculate an upper limit for the memory, use zdb -mm <pool> to retrieve the number of segments currently in use in each metaslab, divide it by two to model a worst-case scenario (each occupied segment would be followed by a free one), multiply it by the record size for an AVL node (two memory pointers and a value, given the 128-bit nature of zfs and 64-bit addressing would sum up to 32 bytes, although people seem to generally assume 64 bytes for some reason).

zdb -mm tank | awk '/segments/ {s+=$2}END {s*=32/2; printf("Space map size sum = %d\n",s)}'

Reference: the basic outline is contained in this posting by Markus Kovero on the zfs-discuss mailing list, although I believe he made some mistakes in his calculation which I hope to have corrected in mine.

Solution 2:

Yes. You need to keep free space in your pool. It's mainly for copy-on-write actions and snapshots. Performance declines at about 85% utilization. You can go higher, but there's a definite impact.

Don't mess with reservations. Especially with NFS. It's not necessary. Maybe for a zvol, but not NFS.

I don't see the confusion, though. If you have 10T, Don't use more than 85% of it. Size your shares appropriately, using quotas to cap their use. Or don't use any quotas and monitor your overall pool usage.