ZFS hot spares versus more parity

About hot spares

Hot spares are set to a specific pool, but can be attached to any vdev of the pool automatically in case of failures. If you only have a single vdev consisting of all your disks, you are better off directly incorporating the disks (except if you already have RAIDZ3 and still disks to spare).

Additionally, resilvering takes time and occurs in a vulnerable (RAIDZ1, 2-way mirrors) or performance-reduced state (RAIDZ2, RAIDZ3, 3-way mirrors), which would not have occurred if you had already attached the device to the vdev.

Basically hot spares are a thing for large arrays. If you have 27 disks split into 3 vdevs of 9 disks in RAIDZ3, you can add 3 hot spares to reduce the "It's 2 AM and 3 disks have crashed, now I have to get up and fix this mess" moments (assuming a 32 drive bay system). Smaller systems usually don't have enough disks to even get to the "2+ vdevs and Z2/Z3" situation. An exception would be mirrors (e. g. 6 x 2), where crashes are much closer to being fatal for the pool (and you don't have enough disks to make them 6 x 3).


Optimal pool layout

Some advice from Nex7's blog regarding pool layout:

  • Do not use raidz1 for disks 1TB or greater in size.
  • For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev (and again, they should be under 1 TB in size, preferably under 750 GB in size) (5 is a typical average).
  • For raidz2, do not use less than 6 disks, nor more than 10 disks in each vdev (8 is a typical average).
  • For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev (13 & 15 are typical average).
  • Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives. Only downside is redundancy - raidz2/3 are safer, but much slower. Only way that doesn't trade off performance for safety is 3-way mirrors, but it sacrifices a ton of space (but I have seen customers do this - if your environment demands it, the cost may be worth it).
  • For >= 3TB size disks, 3-way mirrors begin to become more and more compelling.

This means in your case you would have the following options:

  1. 9 disks usable: (Z3 with 9+3)
  2. 8 disks usable: (Z2 with 4+2) ++ (Z2 with 4+2)
  3. 5 disks usable: (2-mirrors) * 5 ++ (hot spare) * 2
  4. 4 disks usable: (3-mirrors) * 4

I would rank them (descending) as:

  • In terms of usable space: 1, 2, 3, 4
  • In terms of safety: 1, 2/4, 3
  • In terms of speed: 4, 3, 2, 1
  • In terms of ability to extend/add drives: 3, 4, 2, 1

I would not use RAIDZ1 regardless of size, because you might want to later replace them with larger disks and then the problems will show (meaning you would not wnat to upgrade this way and might not be able to grow the storage space without adding additional disks).


I've just been benchmarking a test ZFS setup to answer that very question in regards with performances (on a pair of old dusty servers revived from their ashes).

My setup is:

  • 2x Intel Xeon L5640 CPU @ 2.27GHz (total: 12 cores; HT disabled)

  • 96GiB DDR3 RAM @ 1333MHz

  • Adaptec 5805Z controller, exporting all disks as JBODs (with write-cache enabled, thanks to the controller's battery-backed NVRAM)

  • 12x 15kRPM 146GB SAS disks (Seagate ST3146356SS)

  • per-disk DRBD replication (protocol C) via IP-over-Infiniband (20Gb/s Mellanox MT25204)

  • ZFS 0.7.6 on Debian/Stretch

  • zpool create -o ashift=12 ... /dev/drbd{...} (Note: DRBD works with a replication "unit" size of 4KiB)

  • zfs create -o recordsize=128k -o atime=off -o compression=off -o primarycache=metadata ... (the last two for benchmarking purposes only)

Below the bonnie++ results for all possible interesting combinations of RAIDz2 and RAIDz3 (averaged across 5 runs of 12 synchronized bonnie++ processes):

TEST: # data bandwidth
      bonnie++ -p <threads>
      for n in $(seq 1 <threads>); do
        bonnie++ -r 256 -f -s 1024:1024k -n 0 -q -x 1 -y s &
      done
      # create/stat/delete operations
      bonnie++ -p <threads>
      for n in $(seq 1 <threads>); do
        bonnie++ -r 256 -f -s 0 -n 128:0:0:16 -q -x 1 -y s &
      done

CASE: 1*RAIDz2(10d+2p+0s), ashift:12(4k), recordsize:128k, threads:12, runs:5(data)/3(ops)
 MiB/s: WR=278273, RW=150845, RD=487315
 ops/s: SCr=132681, SDl=71022, RCr=133677, RDl=71723

CASE: 1*RAIDz3(9d+3p+0s), ashift:12(4k), recordsize:128k, threads:12, runs:5(data)/3(ops)
 MiB/s: WR=276121, RW=158854, RD=480744
 ops/s: SCr=132864, SDl=71848, RCr=127962, RDl=71616

CASE: 1*RAIDz2(9d+2p+1s), ashift:12(4k), recordsize:128k, threads:12, runs:5(data)/3(ops)
 MiB/s: WR=260164, RW=151531, RD=541470
 ops/s: SCr=137148, SDl=71804, RCr=137616, RDl=71360

CASE: 1*RAIDz3(8d+3p+1s), ashift:12(4k), recordsize:128k, threads:12, runs:5(data)/3(ops)
 MiB/s: WR=269130, RW=184821, RD=672185
 ops/s: SCr=134619, SDl=75716, RCr=127364, RDl=74545

CASE: 1*RAIDz2(8d+2p+2s), ashift:12(4k), recordsize:128k, threads:12, runs:5(data)/3(ops)
 MiB/s: WR=255257, RW=135808, RD=509976
 ops/s: SCr=136218, SDl=74684, RCr=130325, RDl=73915

CASE: 2*RAIDz2(4d+2p+0s), ashift:12(4k), recordsize:128k, threads:12, runs:5(data)/3(ops)
 MiB/s: WR=379814, RW=225399, RD=586771
 ops/s: SCr=120843, SDl=69416, RCr=122889, RDl=65736

DATA: WR  = Sequential Write
      RW  = Sequential Rewrite
      RD  = Sequential Read
      SCr = Sequential Create
      SDl = Sequential Delete
      RCr = Random Create
      RDl = Random Delete

As far as performances are concerned:

  • 2*RAIDz2(4d+2p+0s) is the winner for balanced read/write performances

  • 1*RAIDz3(8d+3p+1s) for maximum read performances (quite strangely)

As to how to interpret/explain those results; my 1-pennies:

  • 8 data disks divides the 128k recordsize exactly, it might explain (?) why they always outperform 9 or 10 data disks (given the test is run with 1024k chunk size, which aligns exactly on all disks)

  • I would expect RAIDz3 to perform worse than RAIDz2, yet the 1*RAIDz3(8d+3p+1s) case very strangely contradicts this

  • the significantly smaller VDEVs size of the 2*RAIDz2(4d+2p+0s) case might explain (?) why it performs significantly better for writes

EDIT 1

In response to @AndrewHenle comment, below are additional benchmarks with varying "chunk" sizes. Unfortunately, bonnie++ does not allow chunk sizes other than power of 2; so I reverted to (5 averaged runs) of dd: PS: remember, ZFS read cache (ARC) is disabled

TEST: # WR: Sequential Write
      rm /zfs/.../dd.*
      for n in $(seq 1 <threads>); do
        dd if=/dev/zero of=/zfs/.../dd.${n} bs=<chunk> count=<count> &
      done
      # RD: Sequential Read
      for n in $(seq 1 <threads>); do
        dd of=/dev/null if=/zfs/.../dd.${n} bs=<chunk> count=<count> &
      done

CASE: 1*RAIDz2(10d+2p+0s), ashift:12(4k), recordsize:128k, threads:12, runs:5
 chunk:      1280k   1152k   1024k    128k      4k
 count:       1024   (n/a)    1024   10240  327680(32768 for RD)
 MiB/s: WR  418.64   (n/a)  434.56  404.44  361.76
        RD  413.24   (n/a)  469.70  266.58   15.44

CASE: 1*RAIDz3(9d+3p+0s), ashift:12(4k), recordsize:128k, threads:12, runs:5
 chunk:      1280k   1152k   1024k    128k      4k
 count:       1024    1024    1024   10240  327680(32768 for RD)
 MiB/s: WR  428.44  421.78  440.76  421.60  362.48
        RD  425.76  394.48  486.64  264.74   16.50

CASE: 1*RAIDz3(9d+2p+1s), ashift:12(4k), recordsize:128k, threads:12, runs:5
 chunk:      1280k   1152k   1024k    128k      4k
 count:       1024    1024    1024   10240  327680(32768 for RD)
 MiB/s: WR  422.56  431.82  462.14  437.90  399.94
        RD  420.66  406.38  476.34  259.04   16.48

CASE: 1*RAIDz3(8d+3p+1s), ashift:12(4k), recordsize:128k, threads:12, runs:5
 chunk:      1280k   1152k   1024k    128k      4k
 count:       1024   (n/a)    1024   10240  327680(32768 for RD)
 MiB/s: WR  470.42   (n/a)  508.96  476.34  426.08
        RD  523.88   (n/a)  586.10  370.58   17.02

CASE: 1*RAIDz2(8d+2p+2s), ashift:12(4k), recordsize:128k, threads:12, runs:5
 chunk:      1280k   1152k   1024k    128k      4k
 count:       1024   (n/a)    1024   10240  327680(32768 for RD)
 MiB/s: WR  411.42   (n/a)  450.54  425.38  378.38
        RD  399.42   (n/a)  444.24  267.26   16.92

CASE: 2*RAIDz2(4d+2p+0s), ashift:12(4k), recordsize:128k, threads:12, runs:5
 chunk:      1280k   1152k   1024k    128k      4k
 count:       1024   (n/a)    1024   10240  327680(32768 for RD)
 MiB/s: WR  603.64   (n/a)  643.96  604.02  564.64
        RD  481.40   (n/a)  534.80  349.50   18.52

As for my 1-pennies:

  • ZFS obviously optimizes writes intelligently enough (even for chunk sizes below the record size) and/or (?) the Adaptec controller (non-volatile, 512MiB) cache significantly helps in this regards

  • Obviously again, disabling the ZFS read cache (ARC) is very detrimental for chunk sizes close or below the record size; and it seems the Adaptec controller cache is (surprisingly ?) not used for read purposes. Bottom-line: disabling ARC for benchmark purposes allows to have insights into "raw, low-level" performances but is ill-advised for production use (apart from specific cases, like a seldom-used library of large files)

  • Adjusting the chunk size to match the VDEVs size appears not to play a positive role [WRONG ASSUMPTION; see EDIT 2]

EDIT 2

About RAIDz and block size (ashift) and record size (ZFS filesystem):

  • RAIDz fills the underlying devices by data/parity blocks which size are dictated by the ashift size

  • records (not blocks) are the "base unit" of checksum and Copy-on-Write operations

  • ideally, the ZFS filesystem record size should be divisible by the quantity (D) of data disks in the VDEVs (but since it must be a power of two, that may be difficult to achieve)

  • https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSRaidzHowWritesWork

  • https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSRaidzHowWritesWorkII

  • https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSLogicalVsPhysicalBlockSizes

And a WARNING:

  • hot spares may not work unless very carefully configured and the functionality tested

  • https://www.reddit.com/r/zfs/comments/4z19zt/automatic_use_of_hot_spares_does_not_seem_to_work/

BOTTOM LINE (confirming what has already been said in other responses)

  • (Striping) smaller VDEVs - with less data disks - perform better than large ones; computing/verifying the parity is obviously a costly operation, which grows worse than linearly over the quantity of data disks (cf. 8d <-> 2*4d cases)

  • Same-size VDEVs with more parity disk(s) perform better than fewer parity disk(s) and hot spare(s), and provide better data protection

  • Use hot spare(s) to address "don't wake me up in the middle of the night" concerns, if you still have disk(s) to spare after favoring parity disk(s) [WARNING! see EDIT 2]

POST SCRIPTUM

My eventual use case being hosting a (long term) time-series database (steady medium-sized writes and potentially very large sporadic reads), for which I have very little detailed documentation on the I/O patterns (save for an "optimized for SSD" recommendation), I will personally go for 1*RAIDz2(8d+3p+1s): maximum security, little less capacity, (2nd) best performances


My recommendation is:

2 x 5-disk RAIDZ1 + two spares

or

3 x 3-disk RAIDZ1 + spares

or

10-disk RAID mirrors

or 2 x RAIDZ2 of 5 or 6 disks with or without spares

This depends on the disk type in use. If 7200 RPM drives over 2TB, go towards RAIDZ2. If 2TB and user, RAIDZ1 is fine.

Tags:

Raid

Zfs