Facebook must handle large amounts of data and do data analysis on that data and how it is used. Data analysis is hard to do on data for the site because users want to pull the data and pulling data for analysis would slow down the site.
This is a general talk about data analysis involving very large amounts of data.
Facebook has built numerous tools to analyze data involving statistical tools, visual graphs, varying timescales, etc. Building these tools was necessary to automate the analysis, so a small number of people could do the actual analysis.
Large scale log file analysis is easier than doing analysis on a database that is updated frequently. You do not need to keep historical info in the database because it is never retrieved by the site.
Facebook is looking at distributed databases. Commodity hardware is used for the data center.
Data analysis must provide answers to questions from finance, marketing, etc. Don't collect data without a purpose; the amount of data collected can become overwhelming. Focus on what you can learn from your data.
This presentation gave a lot of general principles for data analysis. This will become more important in the coming years and more companies need to deal with such large amounts of data and attempt to analyze the data and learn something from it.
Wednesday, January 28, 2009
An Architecture for Modular Data Centers
Not everyone is as big as Google with trained teams running around managing hardware failures in their systems. Expanding capacity can be difficult, especially if one does not have sufficient trained personnel. Even with an up and running data center, on-site hardware service is expensive in that skilled service personnel must be available at each data center. Staffing a data center with full time service technicians is seldom cost-effective unless the facility is very large. Most services contract this work out and can spend 25% of the system’s price over a 3-year service life. Also, human administrative error can increase system outages.
The proposed solution is to no longer build and ship single systems or even racks of systems. Instead, the supplier ships macro-modules consisting of a thousand or more systems. Each module is built in a 20-foot standard shipping container, configured, burned in, and delivered as a fully operational module with full power and networking in a ready to run no-service-required package. All that needs to be done upon delivery is provide power, networking, and chilled water.
The tradeoff with a “no-service-required” package is that parts are not replaced as they fail. The modules are self-contained with enough redundancy that, as parts fail, surviving systems continue to support the load. The components are never serviced and the entire module just slowly degrades over time as more and more systems suffer non-recoverable hardware errors. Such a module requires that software applications implement enough redundancy so that individual node failures don’t negatively impact overall service availability (my initial thoughts on this subject are in the Google Cluster summary). Since the model of using commodity class PCs and tolerating hardware failures in software is already growing, these modules do not pose additional significant challenges in that regard.
Such work by Microsoft, Rackable Systems, Sun Microsystems, and others will likely become even more important in 10 years to companies delivering services over the web that want to have control of their datacenter clusters with the ability to scale up quickly, while still minimizing maintenance costs.
The proposed solution is to no longer build and ship single systems or even racks of systems. Instead, the supplier ships macro-modules consisting of a thousand or more systems. Each module is built in a 20-foot standard shipping container, configured, burned in, and delivered as a fully operational module with full power and networking in a ready to run no-service-required package. All that needs to be done upon delivery is provide power, networking, and chilled water.
The tradeoff with a “no-service-required” package is that parts are not replaced as they fail. The modules are self-contained with enough redundancy that, as parts fail, surviving systems continue to support the load. The components are never serviced and the entire module just slowly degrades over time as more and more systems suffer non-recoverable hardware errors. Such a module requires that software applications implement enough redundancy so that individual node failures don’t negatively impact overall service availability (my initial thoughts on this subject are in the Google Cluster summary). Since the model of using commodity class PCs and tolerating hardware failures in software is already growing, these modules do not pose additional significant challenges in that regard.
Such work by Microsoft, Rackable Systems, Sun Microsystems, and others will likely become even more important in 10 years to companies delivering services over the web that want to have control of their datacenter clusters with the ability to scale up quickly, while still minimizing maintenance costs.
Tuesday, January 27, 2009
Web Search for a Planet: The Google Cluster Architecture
Google is attempting to solve the problem of how to deliver its bread and butter service: web search. Google must deliver a search quickly while still maintaining a low cost.
Google’s main idea is to use clusters of thousands of commodity class PCs and tolerating hardware failures in software. Such an architecture delivers superior throughput performance at a fraction of the cost of a system built from fewer, more expensive, high end servers. As a result, Google can afford to use more computational resources per search query.
Google also aggressively exploits the very large amounts of inherent parallelism in web search. For example, Google transforms the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. By parallelizing the search over many machines, Google reduces the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. As a result, Google purchases the CPU that currently gives the best performance per unit price, not the CPUs that give the best absolute performance. This replication also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed.
Google’s main assumption in a search is that accesses to the index and other data structures involved in answering a query are read-only. This avoids the consistency issues that arise in using a general-purpose database. While this is valid for web search, many applications will require updates to the data. Separating the computation task from the storage task and using a completely distributed storage system may be a solution to the problem, assuming we are willing to accept eventual consistency for the replicas of our data.
This paper was written five years ago, but the problem addressed by the paper is still valid. Today, the problem is even more prevalent (Google alone is estimated to have 500,000 systems in 30 data centers world-wide) and will continue in the future with the rise of many large data centers from different companies. There is an increased focus by a variety of companies on providing fault tolerance through software because even the most fault tolerant hardware is not infallible and using commodity hardware can provide a more cost effective solution.
Google’s main idea is to use clusters of thousands of commodity class PCs and tolerating hardware failures in software. Such an architecture delivers superior throughput performance at a fraction of the cost of a system built from fewer, more expensive, high end servers. As a result, Google can afford to use more computational resources per search query.
Google also aggressively exploits the very large amounts of inherent parallelism in web search. For example, Google transforms the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. By parallelizing the search over many machines, Google reduces the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. As a result, Google purchases the CPU that currently gives the best performance per unit price, not the CPUs that give the best absolute performance. This replication also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed.
Google’s main assumption in a search is that accesses to the index and other data structures involved in answering a query are read-only. This avoids the consistency issues that arise in using a general-purpose database. While this is valid for web search, many applications will require updates to the data. Separating the computation task from the storage task and using a completely distributed storage system may be a solution to the problem, assuming we are willing to accept eventual consistency for the replicas of our data.
This paper was written five years ago, but the problem addressed by the paper is still valid. Today, the problem is even more prevalent (Google alone is estimated to have 500,000 systems in 30 data centers world-wide) and will continue in the future with the rise of many large data centers from different companies. There is an increased focus by a variety of companies on providing fault tolerance through software because even the most fault tolerant hardware is not infallible and using commodity hardware can provide a more cost effective solution.
Subscribe to:
Posts (Atom)