Failure Trends in a Large Disk Drive Population

This paper attempts to understand factors that affect the reliability of high-capacity magnetic disk drives. Such understanding is useful for guiding the design of storage systems and data centers. This study used a much larger population size than previous studies and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime.

The study looked at various factors such as utilization, temperature, and SMART signals (self-monitoring signals).

The authors chose to measure utilization in terms of weekly averages of read/write bandwidth per drive. Other than the infant mortality, the data indicates a weaker correlation between utilization levels and failures than previous studies. It would be interesting to see if there is any correlation between failure rates and number of read/write since less stress may be placed on a drive that makes long continuous reads.

The authors reported temperature effects only for the high end of the temperature range (greater than 45 degrees C) and especially for older drives. This result could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. Datacenters can allow for higher temperatures to lower cooling costs with minimal effect on drive failures. In a system that can accept such failures, this tradeoff may be cost effective. Also, since hard drives have a high infant mortality rate under low temperatures, initial testing should be done under low temperatures preshipment testing should be done under low temperatures.

The author were able to see that most age-related results are impacted by drive manufacturer and model. However, the authors do not show a breakdown of drives per manufacturer, model, or vintage “due to the proprietary nature of these data.” The only reason this information is “proprietary” is because Google wishes to maintain a competitive advantage over other companies purchasing large amounts of hard drives. This decision is the most unfortunate decision made by the authors of the paper. Making this information public would allow manufacturers to draw insights from specifically what conditions correlate with the failures based upon design or manufacturing decisions that were made on the particular drive model.

The long term value of this study will be in the publishing of periodic results allowing companies to use this knowledge to reduce costs and hard drive manufacturer to attempt design, manufacturing, and testing changes that decrease failures in real world conditions.

