Wednesday, March 4, 2009

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin is a programming language developed at Yahoo Research which allows for writing parallel-processing functions in a functional, object/variable programming style that is familiar to most programmers. Typical Pig programs have 1/20 the lines of code of common Hadoop/MapReduce code, 1/16 the development time, with only 1.5x performance hit. Such a programming language would be useful in rolling out new applications and features. If a feature becomes popular, it can be optimized. Pig is still young and optimizations will likely occur.
We had a good talk from Christopher Olston at Yahoo Research on this subject. The most valuable lessons from this talk and the other industry speakers have been the lessons learned from working on large scale implementations. One important lesson is that such projects need to be broken up into layers that perform a simple function well. Another important lesson comes from the types of data sets involved. While much work centers around data processing on large data sets, a great deal of processing occurs on a combination of large and small data sets. This has significant implications for design decisions of the distributed storage layer (replication and computation locality).

No comments:

Post a Comment