Sunday, February 1, 2009

Failure Stories

The problem examined by these articles is how to deal with and prevent data center failures.

The articles give an analysis of why various data centers failed and how a similar failure will be prevented in the future. However, the problem with such an analysis is summed up by the Shumate article “Every data center is unique - there is no other like it. Every design is a custom solution based on the experience of the engineer and the facility executive.” The same article continues with an overview of design principles for a reliable data center. The fundamental lesson is that one cannot assure that a data center will not fail, and while the manager should try to have enough redundancy to minimize the chances of a failure, there should also be an approach to “fail small” as advised in the Miller article.

As we discussed in the previous articles, even the most fault tolerant hardware is not infallible. There is now an increased focus on providing fault tolerance through software and this model should be expanded across data centers so that even if an entire data center fails, the service should only be degraded by the fraction of the total available resources across data centers.

The case study of 365 Main’s outage is illustrative of what could have occurred using an approach that did not depend on the reliability of the power grid whether backup systems will respond effectively. The power outage affected “certain customers” (CraigsList, Technorati, LiveJournal, TypePad, AdBrite, the 1Up gaming network, Second Life and Yelp, among others) in 3 of 8 colocation rooms in 365 Main’s San Francisco data center. If 365 Main’s data center operated on a distributed model that allowed any machine to handle any request, the data center would still have been able to provide 5/8 of the full capacity to all of it’s customers. While one may argue that this would have been worse for the customers who were not affected, those customers realize that they were merely lucky and that they cannot rely upon 365 Main’s assurances of uninterrupted service. In fact, taking the approach of replication with distributed storage and service across data centers, 365 Main’s outage would not likely have affected customers’ service since 365 has 7 data centers.

Many companies have already realized that an architecture of clusters of thousands of commodity class PCs with software providing fault tolerance can deliver throughput performance at a fraction of the cost of a system built from fewer, more reliable, more expensive, high end servers. Replication across these clusters also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed. In the coming years, data center managers will realize that the resources devoted to “guaranteeing” that a data center will not fail would be better spent providing a greater number of computers at many different locations.

No comments:

Post a Comment