Tuesday, March 31, 2009

Monday, March 30, 2009

D-Trace

Problem: Dynamic instrumentation of production systems to understand performance problems.

Solution: D-Trace provides on the order of 30,000 instrumentation points in the kernel alone. D-Trace provides an API to instrument both kernel and user level behavior, letting the same tracing program access both. The C-like control language for tracing scripts is a good contribution. D-Trace also provides easy access into kernel variables while keeping tracing code "safe".

DTrace’s primary difference from previous work is that it combines pieces of previous work into a single framework.

Future Influence: I would hope that other system builders develop their own tracing frameworks that contain key contributions from DTrace. This would be especially useful in large scale Linux deployments.

X-Trace

Problem: Current network diagnostic tools only focus on one particular protocol layer, and the insights they provide on the application cannot be shared between the user, service, and network operators.

Solution: The authors propose X-Trace, a cross-layer, cross-application tracing framework designed to reconstruct the user’s task tree.

While these tools are undoubtedly useful, they are also typically unable to diagnose subtle interactions between protocols or provide a comprehensive view of the system’s behavior.

X-Trace has significant limitations, such as implementing X-Trace requires modifications to clients, servers, and network devices and when X-Trace is only partially deployed the ability to trace those parts of the network is impaired, sometimes entirely.

The authors analyzed X-Trace in three deployed scenarios: DNS resolution, a three-tiered photo-hosting website, and a service accessed through an overlay network.

Future Impact: Looking beyond just one protocol layer is important for the direction systems are moving in; the multilayered, component approach. The need for modification in many places and that it is fully deployed will mean that you must build your own system or be locked into one provider.

Wednesday, March 18, 2009

Artemis: Hunting for Problems with Artemis

Problem: There is a need for a data collection system for monitoring and analyzing large distributed systems.

Solution: Artemis has the same goal as Chukwa. While Chukwa is built on Hadoop, Artemis (from Microsoft) is focused on analyzing Dryad (Microsoft) based distributed applications. Artemis seems to be fairly well along in terms of the log collection, storage, visualization, and analysis. Artemis has log collection from many different sources within Dryad.

Conclusion: The growth of distributed systems has resulted in a need for new tools to analyze large clusters of PCs, networked together, and running large distributed jobs. Systems such as Chukwa and Artemis are the first iterations of such tools. Artemis shows that these tools need to have access to data collection not only from each machine but from within the distributed processing jobs. This would be a good evolution for the Chukwa developers to follow; providing data collection from within MapReduce jobs.

Tuesday, March 17, 2009

Chukwa: A large-scale monitoring system

Problem: There is a need for a data collection system for monitoring and analyzing large distributed systems.

Solution: The authors developed Chukwa on top of Hadoop as an implementation of such a data collection system. The goal was to be able to monitor Hadoop clusters of 2000 nodes, outputting 5 to 6 MB of data per second, and to have collected data available for processing within ten minutes.

At the heart of any data collection system is a pipeline to pump data from where it is generated to where it is stored. Unfortunately, HDFS is not designed for the sort of workloads associated with monitoring. HDFS aims to handle large files and high write rates from comparatively small numbers of writers. It is not designed for thousands of concurrent low-rate writers, and millions of small files.

Much of the Chukwa design was driven by the need to reconcile our many sporadic data sources with HDFS’s performance characteristics and semantics. Adaptors are the data sources. Rather than have each adaptor write directly to HDFS, data is sent across the network to a collector process, that does the HDFS writes. Each collector receives data from several hundred hosts, and writes all this data to a single sink file. Collectors thus drastically reduce the number of HDFS files generated by Chukwa, from one per machine or adaptor per unit time, to a handful per cluster. The decision to put collectors between data sources and the data store has other benefits. Collectors hide the details of the HDFS file system in use, such as its Hadoop version, from the adaptors.

Scribe

Problem: Many large websites have thousands to hundreds of thousands of machines across many geographical dispersed datacenters. These machines need to send log data to a central repository for analysis.

Solution: Scribe is Facebook's scalable logging system. Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination.

The next step is to do something useful with the data from the logging system. Identifying log messages with a category effectively presorts the data, allowing it to be logged to separate files based on that category.

Hive is Facebook's dataharehouse and the output of Scribe. Hive makes it easy for business analysts to ask ad hoc questions of terabytes worth of logfile data by abstracting MapReduce into a SQL like dialect.

Sunday, March 15, 2009

AWS

The Cloud Architectures whitepaper goes over various benefits to using a cloud provider and goes over design priciples for building a cloud application.

Business benefits to building applications using Cloud Architectures:
-Almost zero upfront infrastructure investment
-Just-in-time Infrastructure
-Efficient resource utilization
-Usage-based costing: Utility-style pricing allows billing the customer only for the infrastructure that has been used. The customer is not liable for the entire infrastructure that may be in place.
-Potential for shrinking the processing time: Parallelization is the one of the great ways to speed up processing.

Applications that could utilize the power of Cloud Architectures: processing pipelines, batch processing systems, websites.

This paper goes over some design principles for building a scalable application for a cloud: use scalable components that are loosely coupled, take advantage of parallelization, requisition and relinquish resources on-demand, and use resilient designs because hardware will fail. The paper uses Amazon's implementation of GrepTheWeb to highlight design lessons and principles.

Google Apps Security

Safekeeping of data for tens of millions of users can be a serious challenge for large, complex systems. The argument for using a large cloud provider is that it has more resources to ensuring security and has a broad range of experience in terms of security, especially compared to small or medium organizations.

This paper looks at Google’s security strategy and is an overview of what Google considers the main focus areas and design principles. The paper tries to convince perspective customers that its cloud infrastructure is safe.

Google focuses on several aspects of security that are critical:
• Organizational and Operational Security – Policies and procedures to ensure security at every phase of design, deployment and ongoing operations.
• Data Security – Ensuring customer data is stored in secure facilities, on secure servers, and within secure applications.
• Threat Evasion – Protecting users and their information from malicious attacks and would-be hackers.
• Safe Access – Ensuring that only authorized users can access data, and the access channel is secure.
• Data Privacy – Ensuring that confidential information is kept private and confidential

Wednesday, March 4, 2009

Dryad and DryadLINQ

Problem: Companies like Microsoft, Google, Yahoo, etc have large data sets that can only be effectively processed on large clusters of PCs. The queries being applied to these data sets are growing more complex and a generalized version of MapReduce is needed.

Solution: Microsoft attempts to create its own language for a more generalized version of MapReduce in the line of Yahoo's Pig Latin.

Future Influence: It is unclear if Dryad and anything built upon it will gain acceptance outside of Microsoft. However, the need for a simplified method of running complex queries on large data sets on clusters of PCs will remain and there will be many efforts to develop solutions.

MapReduce: Simplified Data Processing on Large Clusters

Problem: Companies like Google have large data sets that can only be effectively processed on large clusters of PCs. The work must be broken into parts that can be processed in a reasonable time and the result must then be reassembled.

Solution: Google developed an abstraction called MapReduce that allows them to express the simple computations they were trying to perform but hides the details of parallelization, fault-tolerance, data distribution and load balancing.

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines:
1. The MapReduce library in the user program first splits the input files into M pieces and then starts up many copies of the program on a cluster of machines.
2. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It passes the input data to the user-defined Map function. The intermediate result produced by the Map function is buffered in memory.
4. Periodically, the buffered pairs are written to local disk and are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it reads the buffered data from the local disks of the map workers. The reduce worker reads all intermediate data and sorts it. The sorting is needed because typically many different keys map to the same reduce task.
6. The reduce worker iterates over the sorted intermediate data and passes the data to the user’s Reduce function. The output of the Reduce function is appended to a final output file.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

Locality of computation and data is a common theme repeated by many industry researchers. Network bandwidth is a relatively scarce resource in our computing environment. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task’s input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.

The authors also describe many operational insights that might not be initially considered. One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes. This significantly reduces the time to complete large MapReduce operations at a cost of only a few percent to computational resources. For example, a sort program takes 44% longer to complete when the backup task mechanism is disabled.
Skipping Bad Records: sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set.
To help facilitate debugging, profiling, and small-scale testing, Google has developed an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine.

MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed or parallel systems to exploit large amounts of resources easily. For example, the size of one phase of the computation dropped from approximately 3800 lines of C++ code to approximately 700 lines when expressed using MapReduce. Also, most of the problems caused by machine failures, slow machines, and networking are dealt with automatically by the MapReduce library without operator intervention.

Future Influence: MapReduce has been highly influential and followed by many other companies with a need to process large data sets in a parallel way on large clusters of commodity PCs. The simply Map then Reduce function is being generalized for other more complex parallel data processing (Pig Latin from Yahoo, Dryad from Microsoft). This continuing research will help simplify large scale parallel data processing in the years to come by breaking up the processing into chunks that can be effectively parallelized on large clusters of PCs.

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin is a programming language developed at Yahoo Research which allows for writing parallel-processing functions in a functional, object/variable programming style that is familiar to most programmers. Typical Pig programs have 1/20 the lines of code of common Hadoop/MapReduce code, 1/16 the development time, with only 1.5x performance hit. Such a programming language would be useful in rolling out new applications and features. If a feature becomes popular, it can be optimized. Pig is still young and optimizations will likely occur.
We had a good talk from Christopher Olston at Yahoo Research on this subject. The most valuable lessons from this talk and the other industry speakers have been the lessons learned from working on large scale implementations. One important lesson is that such projects need to be broken up into layers that perform a simple function well. Another important lesson comes from the types of data sets involved. While much work centers around data processing on large data sets, a great deal of processing occurs on a combination of large and small data sets. This has significant implications for design decisions of the distributed storage layer (replication and computation locality).