Mean Time Between Failures -- SSD

Solution 1:

Drive manufacturers specify the reliability of their products in terms of two related metrics: the annualized failure rate (AFR), which is the percentage of disk drives in a population that fail in a test scaled to a per year estimation; and the mean time to failure (MTTF).

The AFR of a new product is typically estimated based on accelerated life and stress tests or based on field data from earlier products. The MTTF is estimated as the number of power on hours per year divided by the AFR. A common assumption for drives in servers is that they are powered on 100% of the time.

http://www.cs.cmu.edu/~bianca/fast/

MTTF of 1.5 million hours sounds somewhat plausible.

That would roughly be a test with 1000 drives running for 6 months and 3 drives failing.
The AFR would be (2* 6 months * 3)/(1000 drives)=0.6% annually and the MTTF = 1yr/0.6%=1,460,967 hours or 167 years.

A different way to look at that number is when you have 167 drives and leave them running for a year the manufacturer claims that on average you'll see one drive fail.

But I expect that is simply the constant "random" mechanical/electronic failure rate.

Assuming that failure rates follow the bathtub curve, as mentioned in the comments, the manufacturer's marketing team can massage the reliability numbers a bit, for instance by not including DOA'S (dead on arrival, units that passed quality control but fail when the end-user installs them) and stretching the DOA definition to also exclude those in the early failure spike. And because testing isn't performed long enough you won't see age effects either.

I think the warranty period is a better indication for how long a manufacturer really expects a SSD to last!
That definitely won't be measured in decades or centuries...

Associated with the MTBF is the reliability associated with the finite number of write cycles NAND cells can support. A common metric is the total write capacity, usually in TB. In addition to other performance requirements that is one big limiter.

To allow a more convenient comparison between different makes and differently sized sized drives the write endurance is often converted to daily write capacity as a fraction of the disk capacity.

Assuming that a drive is rated to live as long as it's under warranty:
a 100 GB SSD may have a 3 year warranty and a write capacity 50 TB:
        50 TB
---------------------  = 0.46 drive per day write capacity.
3 * 365 days * 100 GB

The higher that number, the more suited the disk is for write intensive IO.
At the moment (end of 2014) value server line SSD's have a value of 0.3-0.8 drive/day, mid-range is increasing steadily from 1-5 and high-end seems to sky-rocket with write endurance levels of up to 25 * the drive capacity per day for 3-5 years.

Some real world tests show that sometimes the vendor claims can be massively exceeded, but driving equipment way past the vendor limits isn't always an enterprise consideration... Instead buy correctly spec'd drives for your purposes.

Solution 2:

Unfortunately the MTBF isn't what most people think...

It is not how long an individual drive will last.

Manufacturers expect their drives to last as long as the warranty, after that it really isn't their problem. Older electromagnetic platter hard drives will seize up after 10 or so years. Integrated circuits last an extremely long time, but other components (notably capacitors) wear out after somewhat predictable number of cycles.
It is how many of these drives you would need to expect 1 drive to fail every hour.

As others have pointed out manufactures do various testing over a reasonable period of time and determine a failure rate. There's a fair amount of variance in these sorts of tests and marketing often has "input" as to what the final number should be. Regardless they make a best effort guess as to how many drives would be needed to average one failure per hour.

For situations with less drives you can infer a statistical probability of failure based on the MTBF, but keep in mind that failures in well designed products should follow a "bathtub" curve - that is higher failure rates when devices are initially put into service and after their warranty period has expired, with lower failure rates in between.

Solution 3:

They come from a statistical evaluation based on a small sample size and a short amount of time. There's really no universally agreed upon method or process so it's really just silly 'marketing'.

This article may explain it a bit more. And Wikipedia has some formulas which might be what you're looking for?

Essentially, for nearly everything (including general household machines such as a dishwasher) several products are run for X amount of time. How many failures happen during this period are used to calculate the MTFB.

It's of course not feasible to run products through an entire lifecycle, i.e SSDs, which will last a long time. They are mostly limited by the amount of writes rather than mechanical failure (which is what MTFB is for)

Solution 4:

Bad news about MTBF is that common evaluation metodics suppose evenly distributed write load among all NAND cells. But cells are grouped into the clusters and when one single cell fails - whole cluster is marked as dead and is replaced with new one from the reserve. Usually reserve is about 20% of the SSD volume. When reserve is exhausted whole SSD will be marked as dead.

IRL SSD contains persistent data as well as volatile. Imagine that you have 90% of SSD filled with static data, and the 10% rest is under the heavy write load. SSD controller spread the load among the available free clusters. That 10% exhausts their lifespan 10 times faster than you have estimated. They will be replaced from the reserve again and again till the end.

In the really bad case where persistent/volatile data amount is 30:1 or greater, for example - pile of photos and relatively small database for popular website, your SSD will die in a year.

One of my customers was very impressed with SSD characteristics and insisted to equip his DBMS-server with pair of them. In the next 12 months we have replaced both of them twice.

But accordingly to the marketing materials lifespan of SSD is 170 years. Sure.

Mean Time Between Failures -- SSD

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Tags:

Ssd

Drive Failure

Related

Recent Posts