Is it necessary to burn-in RAM for server-class hardware?

Solution 1:


The goal of burning in hardware is to stress it to the point of catalyzing a failure in a component.

Doing this with mechanical hard drives will get some results, but it's just not going to do a lot for RAM. The nature of the component is such that environmental factors and age are far more likely to be the cause of failures than reading and writing to the RAM (even at its maximum bandwidth for a few hours or days) would ever be.

Assuming your RAM is high enough quality that the solder won't melt the first time you really start to use it, a burn-in process won't help you find defects.

Solution 2:

I found a document by Kingston detailing how they work with Server Memory, I believe that this process would, normally, be the same for most known manufacturers. Memory chips, as well as all semiconductor devices, follow a particular reliability/failure pattern that is known as the Bathtub Curve:

enter image description here

Time is represented on the horizontal axis, beginning with the factory shipment and continuing through three distinct time periods:

  • Early Life Failures: Most failures occur during the early usage period. However, as time goes on, the number of failures diminishes quickly. The Early Life Failure period, shown in yellow, is approximately 3 months.

  • Useful Life: During this period, failures are extremely rare. The useful life period is shown in blue and is estimated to be 20+ years.

  • End-of-Life Failures: Eventually, semiconductor products wear out and fail. The End-of-Life period is shown in green

Now because Kingston noted that high fail-rates would occur the first three months (after these three months the unit is considered good until it's EOL about 15 - 20 years later). They designed a test using a unit called the KT2400 which brutally tests the server memory modules for 24 hours at 100 degrees celsius at high voltage, by which all cells of every DRAM chip is continuously exercised; this high level of stress testing has the effect of aging the modules by at least three months (as noted before the critical period where most modules show failures).

The results were:

In March 2004, Kingston began a six-month trial in which 100 percent of its server memory was tested in the KT2400. Results were closely monitored to measure the change in failures. In September 2004, after all the test data was compiled and analyzed, results showed that failures were reduced by 90 percent. These results exceeded expectations and represent a significant improvement for a product line that was already at the top of its class.

So why is burning in memory not useful for server memory? Simply, because it's already done by your manufacturer!

Solution 3:

We buy blades and we generally buy in reasonably large block of them at a time, as such we get them in and install them over DAYS before our network ports are ready/secure. So we use that time to use memtest for around 24hrs, sometimes longer if it goes over a weekend - once that's done we spray down the basic ESXi and IP is ready for its host profile to be applied once the network's up. So yeah we test it, more out of opportunity than necessity but it's caught a few DOA DIMMs before now, and it's not me physically doing it so it takes me no effort. I'm for it.

Solution 4:

Well I guess it depends on exactly what your processes is. I ALWAYS run MemTest86 on memory before I put it in a system (server or otherwise). After you have a system up and running, problems caused by faulty memory can be hard to troubleshoot.

As for actually "stress-testing" the memory; I have yet to even see why this would be useful unless you are testing for overclocking purposes.

Solution 5:

I don't, but I've seen people who do. I never saw them gain anything from it though, I think it might be a hangover or superstition perhaps.

Personally, i'm like you in that the ECC error rates are more useful to me - assuming the RAM isn't DOA but then you'd know that anyhow.