Wednesday, March 4, 2009

MapReduce: Simplified Data Processing on Large Clusters

Problem: Companies like Google have large data sets that can only be effectively processed on large clusters of PCs. The work must be broken into parts that can be processed in a reasonable time and the result must then be reassembled.

Solution: Google developed an abstraction called MapReduce that allows them to express the simple computations they were trying to perform but hides the details of parallelization, fault-tolerance, data distribution and load balancing.

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines:
1. The MapReduce library in the user program first splits the input files into M pieces and then starts up many copies of the program on a cluster of machines.
2. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It passes the input data to the user-defined Map function. The intermediate result produced by the Map function is buffered in memory.
4. Periodically, the buffered pairs are written to local disk and are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it reads the buffered data from the local disks of the map workers. The reduce worker reads all intermediate data and sorts it. The sorting is needed because typically many different keys map to the same reduce task.
6. The reduce worker iterates over the sorted intermediate data and passes the data to the user’s Reduce function. The output of the Reduce function is appended to a final output file.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

Locality of computation and data is a common theme repeated by many industry researchers. Network bandwidth is a relatively scarce resource in our computing environment. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task’s input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.

The authors also describe many operational insights that might not be initially considered. One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes. This significantly reduces the time to complete large MapReduce operations at a cost of only a few percent to computational resources. For example, a sort program takes 44% longer to complete when the backup task mechanism is disabled.
Skipping Bad Records: sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set.
To help facilitate debugging, profiling, and small-scale testing, Google has developed an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine.

MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed or parallel systems to exploit large amounts of resources easily. For example, the size of one phase of the computation dropped from approximately 3800 lines of C++ code to approximately 700 lines when expressed using MapReduce. Also, most of the problems caused by machine failures, slow machines, and networking are dealt with automatically by the MapReduce library without operator intervention.

Future Influence: MapReduce has been highly influential and followed by many other companies with a need to process large data sets in a parallel way on large clusters of commodity PCs. The simply Map then Reduce function is being generalized for other more complex parallel data processing (Pig Latin from Yahoo, Dryad from Microsoft). This continuing research will help simplify large scale parallel data processing in the years to come by breaking up the processing into chunks that can be effectively parallelized on large clusters of PCs.

No comments:

Post a Comment