Tuesday, January 27, 2009

Web Search for a Planet: The Google Cluster Architecture

Google is attempting to solve the problem of how to deliver its bread and butter service: web search. Google must deliver a search quickly while still maintaining a low cost.

Google’s main idea is to use clusters of thousands of commodity class PCs and tolerating hardware failures in software. Such an architecture delivers superior throughput performance at a fraction of the cost of a system built from fewer, more expensive, high end servers. As a result, Google can afford to use more computational resources per search query.

Google also aggressively exploits the very large amounts of inherent parallelism in web search. For example, Google transforms the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. By parallelizing the search over many machines, Google reduces the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. As a result, Google purchases the CPU that currently gives the best performance per unit price, not the CPUs that give the best absolute performance. This replication also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed.

Google’s main assumption in a search is that accesses to the index and other data structures involved in answering a query are read-only. This avoids the consistency issues that arise in using a general-purpose database. While this is valid for web search, many applications will require updates to the data. Separating the computation task from the storage task and using a completely distributed storage system may be a solution to the problem, assuming we are willing to accept eventual consistency for the replicas of our data.

This paper was written five years ago, but the problem addressed by the paper is still valid. Today, the problem is even more prevalent (Google alone is estimated to have 500,000 systems in 30 data centers world-wide) and will continue in the future with the rise of many large data centers from different companies. There is an increased focus by a variety of companies on providing fault tolerance through software because even the most fault tolerant hardware is not infallible and using commodity hardware can provide a more cost effective solution.

No comments:

Post a Comment