Hard Disk MTBF: Flap or Farce?


Data sheets for hard drives have always included a specification for reliability expressed in hours: commonly known as MTBF (mean time between failures), or sometimes the mean time to failure. Same difference: One way assumes that a drive will be fixed, and the other, replaced. Nowadays, this number is around a million hours for an “enterprise” hard drive. Some drives are rated at 1.5 million hours.

Now, that’s a good stretch to time. After all, a year is only 8,760 hours. One million hours comes to a bit more than 114 years. Some may be scratching their heads, since the hard drive itself has only been around for 50 years (IBM’s giant 350 Disk Storage Unit for its RAMAC computer). This can be confusing.

Instead, the MTBF is a statistical measure based on a calculation extrapolated from less-lengthy readings. It all means that drives are very reliable, with a failure rate well under 1 percent per year. Go Team Storage!

However, several papers covering large-scale storage presented at FAST ’07, the USENIX conference on File and Storage Technologies, held recently in San Jose, Calif., are kicking up a stir online about MTBF.

The Best Paper award was handed to “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?” by Bianca Schroeder and Garth Gibson of Carnegie Mellon University in Pittsburgh.

Their study tracked a whopping set of drives used at large-scale storage sites, including high-performance computing and Web servers. The data suggests that a number of common wisdoms surrounding disk reliability are wrong.

For example, they found that annual disk replacements rates were more in the range of 2 to 4 percent and were as high as 13 percent for some sites. Yikes.

I found this fascinating article about MTBF and disk failures yesterday. I have known for some time that you must take the MTBF figures with a grain of salt. Disk drives appear to fail more often than what the MTBF figures would leave you to believe. The differences between “enterprise” disk drives and “retail” disk drives appear to be indistinguishable in the real world. Yet as an IT professional we will always recommend the component with the higher perceived quality even though we have misgivings about the statistics. For most businesses the cost of down time due to a disk failure is much higher than the additional cost for quality. Although we hate to admit it, there is a significant subjective component to our component recommendation.