The importance of ECC memory

Solution 1:

Data published by CERN IT staff (Data Integrity) would suggest that the amount of errors that comes from RAM is quite low. You still have to weight your data and the cost of hardware.

You can read a bit more about this at StorageMojo.

Solution 2:

ECC RAM basically helps to prevent errors that occur when reading and writing from RAM. The chance of there actually being an error is quite small, but non-zero. I would say that if you aren't doing mission-critical stuff you could get away without ECC RAM - like I said, the chances of encountering an error that ECC would prevent is really, really small.


Solution 3:

What is a non-critical server? One that can fail?

ECC RAM is fundamental when memory reliability is fundamental.

Two things grow with the growth of memory sizes:

  • the reliance of software on memory, esp. server software (take e.g. caching)
  • the probability of memory error (p = num_bits * p_bit_failure)

This intel presentation on ECC reports these facts:

  • Average rate of memory error for a server with 4GB memory running 24x7 is 150 times a year
  • ~4000 correctable errors per memory module per year
  • Overclocking and system age greatly increase failure rates
  • Recurrent failures are common and happen quickly (97% occur within 10 days of first failure) => avalanche effect
  • For an ECC server with lifespan of 3 to 5 years, chance for system failure uncorrectable memory error is less than 0.001%

Another recent research by WISC shows ECC to be essential for these ZFS systems:

ZFS has no precautions for memory corruptions: bad data blocks are returned to the user or written to disk, file system operations fail, and many times the whole system crashes.

It is important to note that other filesystems are just as sensitive to this form of data corruption as ZFS is.

ECC is what saves you from running into these problems, when possible, and in disastrous cases, what warns you about this happening before it's too late.