BBWC: in theory a good idea but has one ever saved your data?

Solution 1:

Sure. I've had battery-backed cache (BBWC) and later flash-backed write cache (FBWC) protect in-flight data following crashes and sudden power loss.

On HP ProLiant servers, the typical message is:

POST Error: 1792-Drive Array Reports Valid Data Found in Array Accelerator

Which means, "Hey, there's data in the write cache that survived the reboot/power-loss!! I'm going to write that back to disk now!!"

An interesting case was my post-mortem of a system that lost power during a tornado, the array sequence was:

POST Error: 1793-Drive Array - Array Accelerator Battery Depleted - Data Loss
POST Error: 1779-Drive Array Controller Detects Replacement Drives
POST Error: 1792-Drive Array Reports Valid Data Found in Array Accelerator

The 1793 POST error is unique. - While the system was in use, power was interrupted while data was in the Array Accelerator memory. However, due to the fact that this was a tornado, power was not restored within four days, so the array batteries were depleted and data within was lost. The server had two RAID controllers. The other controller had an FBWC unit, which lasts far longer than a battery. That drive recovered properly. Some data corruption resulted on the array backed by the empty battery.


Despite plenty of battery runtime at the facility, four days without power and hazardous conditions made it impossible for anyone to shut the servers down safely. enter image description here

Solution 2:

Yes, had that case.

Server "without UPS" in a data center (with the data center having a UPS). PDU failure - system crashed hard. No data loss.

And that basically is it. The good thing about a BBWC is that it is in the machine. Have a UPS - believe me, sometimes someone does something stupid (like pulling the wrong cable). A UPS is external. Oh, THAT cable ;)


Solution 3:

I've had 2 cases where battery backed cache in HW RAID controllers failed completely (in 2 separate companies).

BBC relies on the unsurprising idea that battery works. The catch is that at some point battery in controller fails and what's devastating is that in many HW raid controllers it fails silently. We thought we had a cache protected against power loss but we did not.

On power loss the RAID array data loss was so extensive that all disk contents were rendered unrecoverable. Everything was lost. One of the cases involved a machine dedicated entirely for testing, but still.

After that I said "never again", switched to software-based disk mirroring (mdadm) in Linux + journal-based fs that has decent resilience against power loss (ext4) and never looked back. Granted, I've used it on servers that did not have extremely high IO usage.


Solution 4:

This seems to necessitate a second answer to the question...

I just had a standalone VMware ESXi host lose a drive in a RAID 5 array. The degraded array impacted performance at the VM and application level.

Smart Array P410i in Slot 0 (Embedded)    (sn: 5001438011138950)

   array A (SAS, Unused Space: 0  MB)

      logicaldrive 1 (1.6 TB, RAID 5, Recovering, 42% complete)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, Rebuilding)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 300 GB, OK, spare)

The IT person at this firm was not aware that a drive failed and hard reset the server (to make it all better?).

The interesting effect of doing this to a compromised array with busy virtual machines running atop was this:

Cache Status Details: The current array controller had valid data stored in its battery/capacitor backed write cache the last time it was reset or was powered up. This indicates that the system may not have been shut down gracefully. The array controller has automatically written, or has attempted to write, this data to the drives. This message will continue to be displayed until the next reset or power-cycle of the array controller.

So even though the system was halted abruptly, the in-flight data was protected by the BBWC. The virtual machines all recovered properly and the system is in good shape now.


Solution 5:

In addition to "saving your data", they are good for other things. They are also good at buffering writes (in the cache) so as to improve performance of the IO subsystem by keeping the disk-write-queue low. This is particularly important for servers where interactive performance is paramount - for example, Citrix XenApp or Windows Terminal Services.

This is less important for a webserver, or a file server. You might not notice, or even be used to, a little lag. However, when you click on an icon in an Office application, you expect responsiveness. And so does your CEO.