Wednesday, January 28, 2009

An Architecture for Modular Data Centers

Not everyone is as big as Google with trained teams running around managing hardware failures in their systems. Expanding capacity can be difficult, especially if one does not have sufficient trained personnel. Even with an up and running data center, on-site hardware service is expensive in that skilled service personnel must be available at each data center. Staffing a data center with full time service technicians is seldom cost-effective unless the facility is very large. Most services contract this work out and can spend 25% of the system’s price over a 3-year service life. Also, human administrative error can increase system outages.

The proposed solution is to no longer build and ship single systems or even racks of systems. Instead, the supplier ships macro-modules consisting of a thousand or more systems. Each module is built in a 20-foot standard shipping container, configured, burned in, and delivered as a fully operational module with full power and networking in a ready to run no-service-required package. All that needs to be done upon delivery is provide power, networking, and chilled water.

The tradeoff with a “no-service-required” package is that parts are not replaced as they fail. The modules are self-contained with enough redundancy that, as parts fail, surviving systems continue to support the load. The components are never serviced and the entire module just slowly degrades over time as more and more systems suffer non-recoverable hardware errors. Such a module requires that software applications implement enough redundancy so that individual node failures don’t negatively impact overall service availability (my initial thoughts on this subject are in the Google Cluster summary). Since the model of using commodity class PCs and tolerating hardware failures in software is already growing, these modules do not pose additional significant challenges in that regard.

Such work by Microsoft, Rackable Systems, Sun Microsystems, and others will likely become even more important in 10 years to companies delivering services over the web that want to have control of their datacenter clusters with the ability to scale up quickly, while still minimizing maintenance costs.

No comments:

Post a Comment