Wednesday, February 25, 2009

Dynamo: Amazon’s Highly Available Key-value Store

Problem: It is difficult to create redundancy and parallelism with relational databases, so they become a single point of failure. Also, as a relational database grows, it becomes a bottleneck for the entire system. The very survival of Amazon’s business depends on common, bullet proof, flexible, and scalable software systems.

Solution: Unlike a relational database, Dynamo is a distributed storage system. Any node in the system can be issued a put or get request for any key. Dynamo is an eventually consistent storage system because if one computer updates object A, these changes need to propagate to other machines.

Physical nodes are thought of as identical and organized into a ring (built on Chord). The partitioning mechanism automatically scales as nodes enter and leave the system. Every object is asynchronously replicated to N nodes. The updates to the system occur asynchronously and may result in multiple copies of the object in the system with slightly different states. The discrepancies in the system are reconciled after a period of time, ensuring eventual consistency.

Dynamo can be tuned using just a handful parameters to achieve different, technical goals that in turn support different business requirements. Dynamo is a storage service in the box driven by an SLA. Different applications at Amazon use different configurations of Dynamo depending on their tolerance to delays or data discrepancy. Each data object is replicated across multiple nodes with timestamp based reconciliation.

Future Influence: Dynamo is a large scale implementation of a distributed storage system. The experiences from all of the companies that have built such systems out of necessity (Google, Amazon, Yahoo, etc) will prove valuable for the development of future systems.
I liked the related work section.

Tuesday, February 10, 2009

Bigtable

Chubby

GFS: The Google File System

Problem: Google required a scalable distributed file system for large distributed data-intensive applications. This file system had to provide fault tolerance while running on inexpensive commodity hardware, and deliver high aggregate performance to a large number of clients.

Solution/Analysis: The view of GFS begins with an interface that allows users to have access to basic file commands. These include commands like open, create, read, write and close files. The team also included a couple of specialized commands: append and snapshot. Append allows clients to add information to an existing file without overwriting previously written data. Snapshot creates quick copy of a computer's contents.

Google organized the GFS into clusters of computers, built from many inexpensive commodity components that often fail. Within GFS clusters there are three kinds of entities: clients, master servers and chunkservers. A "client" refers to any entity that makes a file request. Requests can range from retrieving and manipulating existing files to creating new files on the system. Clients can be other computers or computer applications.

Master server: The master server acts as the coordinator for the cluster. The master's duties include maintaining an operation log, which keeps track of the activities of the master's cluster. The operation log helps keep service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored the operation log can take its place. The master server also keeps track of metadata, which is the information that describes chunks. The metadata tells the master server to which files the chunks belong and where they fit within the overall file. Upon startup, the master polls all the chunkservers in its cluster. The chunkservers respond by telling the master server the contents of their inventories. From that moment on, the master server keeps track of the location of chunks within the cluster.
There's only one active master server per cluster at any one time (though each cluster has multiple copies of the master server in case of a hardware failure). GFS avoids a bottleneck by minimizing the messages the master server sends and receives. The master server doesn't actually handle file data at all, which is up to the chunkservers.

Chunkserver: GFS files tend to be very large, usually in the multi-gigabyte (GB) range. GFS breaks files up into manageable chunks of 64 megabytes (MB) each. Every chunk receives a unique 64-bit identification number called a chunk handle. Chunkservers are responsible for storing the 64-MB file chunks. Chunkservers send requested chunks directly to the client. GFS copies every chunk multiple times and stores it on different chunkservers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired. I wonder if GFS has been updated with some policy for dynamic replication based on need.

File Requests: GFS separates replicas into two categories: primary replicas and secondary replicas. There does not seem to be a fundamental need for this differentiation.

Read request: The client sends a read request to the master server to find out where the client can find a particular file on the system. The server responds with the location for the primary replica of the respective chunk. This is reminiscent of how the old Napster worked: go to a directory (master server) which tells you where to get the data. Modern distributed file systems can give you the “nearest” copy as well.

Write requests: The client sends a request to the master server, which replies with the location of the primary and secondary replicas. If a client creates a write request that affects multiple chunks of a particularly large file, the GFS breaks the overall write request up into an individual request for each chunk.

Stale Chunks: Whenever a client changes a chunk, the master server assigns it a new number. This is how the master server can tell the difference between valid chunks and invalid copies known as stale chunks.

Reliable system built on unreliable components: The GFS components give system updates through electronic messages called heartbeats and handshakes. These handshakes allow the master server to stay current with each chunkserver's status. The GFS developers built functions into the system to compensate for the inherent unreliability of individual components. Those functions include master and chunk replication, a streamlined recovery process, rebalancing, stale replica detection, garbage removal and checksumming.

Bandwidth focus: Google developers are greatly concerned with bandwidth because Google applications manipulate very large files and many users are making requests. By lagging behind the leading edge of hardware technology, Google can purchase equipment and components at bargain prices. The structure of the GFS is such that it's easy to add more machines at any time and rebalance the workload.

Future viability: GFS seems to provide Google the necessary file system for an application such as search. However, this paper was written five years ago and already Google has expanded into providing many different applications. Google’s assumptions about file use will not hold under many new applications which will require frequent updates to data which will need to be made available for access. This paper does provide a basic principle that will hold: data can only be made reliably available to many clients through replication across many servers, each of which can independently serve the data to the client.

Wednesday, February 4, 2009

Ebay: Scale!

Problem: Ebay’s problem is the same problem shared by any large site in the internet today; how to handle many users accessing large amounts of services/data, with the data being updated millions (or billions) of times per day.

Solution: Ebay’s solution rely’s upon 5 principles that Ebay has almost embraced: partitioning, asynchrony, automation of configuration, acceptance of failure, and (almost) embracing inconsistency. My main criticism of the Ebay work rest upon the requirement of immediate consistency for bids and purchases. I wonder how seriously they have considered almost immediate consistency resolved by the application with a timestamp when “immediate consistency” is required at present.

I am heartened by Ebay’s use of 3rd-party cloud services. I am curious as to the present implementation details. I noticed the mention of event-based transactions in “eBay’s Scaling Odyssey”, but there are scant details. It seems that eBay has separated the application from the storage in delivery of their service, but I wonder how fundamental their separation is in implemenation.

Finally, I am curious as to how open eBay would be to a completely distributed storage layer.

Future influence: I think that Ebay’s work will be highly influential in the coming decade. Ebay seems willing to embrace a separation between the application and data in order to provide a scalable dynamic service to many millions of users per day. Ebay’s present work is a stepping stone to a truly scalable model of services and sub-services that will be adopted by any similar large deployment out of necessity.

I would like some comment from the class on my thoughts here. We should embrace the blog structure on the paper comments for sharing ideas.

A Policy-Aware Switching Layer for Data Centers

Problem: Managing an enterprise or data center network is difficult because middleboxes must be placed on the path of the traffic they must filter. This is a not a problem for some companies who do not have a great need for firewalls, etc. But if you have a system with many firewalls, middleboxes are a problem. Financial institutions have multiple layers with many middleboxes. Middleboxes are placed at a chokepoint in the network. Info must pass a particular point for firewalls and load balancers.

Solution: The solution proposed by the authors takes the middleboxes out of the network path. The solution specifically routes the data through the firewall. Implicit routing causes one to route the packet through a path with the firewall. Explicit routing through the middlebox is necessary to ensure the data goes through the middlebox.

Tradeoffs: The solution uses a little extra bandwidth. But explicit routing ensures correctness. Other solutions to the bottleneck of middleboxes involve throwing resources at the problem; more middleboxes at the bottlenecks.

Future Influence: The most important contribution of this paper is that it makes the case higher-level policy control in a data center, specifically the use of indirection within a datacenter. Many data center managers are happy with what they have, but this is less of a hurdle than introducing a new architecture in the internet. Such a new proposal is possible in a datacenter because there are many new clean slates (including upgrades being a clean slate).

Tuesday, February 3, 2009

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

This paper attempts to solve the same problem as A Scalable, Commodity Data Center Network Architecture: the high cost of high-end non-commodity switches and the lack of full bandwith across the entire datacenter. DCell takes the network architecture into the servers. DCell also requires that each server have 2 GigE connections, but this is achieved with a readily available quad port Ethernet adapter. In data centers with tens to hundreds of thousands of servers, it would be desirable to cost-reduce with a Dual Port Ethernet adapter. The modifications to servers would not likely be a drawback since such a setup would be used in a new deployment (the interconnection costs have already been sunk in existing deployments).

The authors focus on three design goals for data center networks (DCN). First, the network infrastructure must be scalable to a large number of servers and allow for incremental expansion. Second, DCN must be fault tolerant against various types of server failures, link outages, or server-rack failures. Third, DCN must be able to provide high network capacity.

The main benefits of the DCell approach is that it has a great degree of fault tolerance and can scale to millions of servers with only a level 3 DCell and truly commodity 8-port switches.

The main drawback of this approach is the complexity of the wiring interconnections for even a DCell structure, especially when repeated many times, and up to a level 2 or higher DCell. This provides an opportunity for modular data centers pre-configured in DCells from the manufacturer. These modular DCells would have clear instructions (possibly color-coded) for interconnecting the modular DCells to one another and to the outside world.

A Scalable, Commodity Data Center Network Architecture

Internet services increasingly employ service oriented architectures, where the retrieval of a single web page can require coordination and communication with many individual sub-services running on remote nodes. Present, data center networks are expensive due to the use of non-commodity switches at the highest levels. Even with the high cost of the switches, bandwidth can become a bottle-neck if many servers from one section of the network try to communicate with servers in another section.

In this paper, the authors show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. The paper attempts to design a data center communication architecture that meets the following goals: scalable interconnection bandwidth, achieve economies of scale using commodity ethernet switches, backward compatibility with hosts running Ethernet and IP.

Today, the price differential between commodity and non-commodity switches provides a strong incentive to build large-scale communication networks from many small commodity switches rather than fewer larger and more expensive ones. (The table showing the cost decrease of GigE and 10 GigE is very illustrative of the strength of the authors' argument.) Also, the commodity switches use less power and give off less heat. Use of commodity switches is even more compelling due to the increasing data center energy/heat density. The authors make an important point that an approach that uses commodity switches will be the only way to deliver full bandwidth for large clusters once 10 GigE switches become commodity at the edge.

Another strong motivation to use such an interconnected network topology is the fault-tolerance provided by multiple paths. Such a network would be able to continue for some time without immediate repair, or indefinitely if sealed in a modular data center.

The authors chose to ignore wiring costs in their evaluation of their strategy. However, the complexity involved in properly wiring up such a system is not trivial, since many setup errors frequently occur in data centers. This actually provides an opportunity for manufacturers of switches to sell prepackaged solutions with the switches preloaded with the modified routing algorithms and pre-wired interconnections.

In the future, bandwidth will increasingly become the scalability bottleneck in large-scale clusters. Existing solutions for addressing this bottleneck center around hierarchies of switches, with expensive, non-commodity switches at the top of the hierarchy. Larger numbers of commodity switches have the potential to displace high-end switches in data centers in the same way that clusters of commodity PCs have displaced high-end servers. While this specific approach may not be taken, this paper and the DCell paper will hopefully inspire more research on the subject.

Designing a highly available directory service

This article provides strategies for dealing with failures in a directory service.

The chapter broadly defines failure as anything that prevents Directory Server from providing the minimum level of service required. Failure is divided into two main areas: system unavailable and system unreliable.

The chapter provides a few strategies for addressing failures in a directory service. The paper classifies failures in two classes: system unavailable and system unreliable (classification I did not find particularly enlightening). The paper presents two backup strategies: binary (which is full binary replication) and LDIFF (a logical backup based on differences from previous version). Binary replication is a full backup is faster than LDIF. LDIF has greater control over the granularity but in situations where rapid restoration is required, LDIF may take too long to be viable.

The paper then describes replication topologies starting from a single data center up to 5 data centers. The high level idea is to have one or two master servers per data center, linked by an interconnected topology with redundant links and potential backup links. The architecture seems to allow for 4 master servers, even in the Five Data Center Topology (but does not explain why).

This article gave a good perspective about the current industry thinking for replication and backup. It would be interesting to see how well such a model scales. Their consistency algorithm is probably manually configured based upon the interconnections and where the master servers are placed. As a result, it is unlikely to scale well in the multi-thousand server clusters that are becoming prevalent.

Sunday, February 1, 2009

Failure Trends in a Large Disk Drive Population

This paper attempts to understand factors that affect the reliability of high-capacity magnetic disk drives. Such understanding is useful for guiding the design of storage systems and data centers. This study used a much larger population size than previous studies and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime.

The study looked at various factors such as utilization, temperature, and SMART signals (self-monitoring signals).

The authors chose to measure utilization in terms of weekly averages of read/write bandwidth per drive. Other than the infant mortality, the data indicates a weaker correlation between utilization levels and failures than previous studies. It would be interesting to see if there is any correlation between failure rates and number of read/write since less stress may be placed on a drive that makes long continuous reads.

The authors reported temperature effects only for the high end of the temperature range (greater than 45 degrees C) and especially for older drives. This result could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. Datacenters can allow for higher temperatures to lower cooling costs with minimal effect on drive failures. In a system that can accept such failures, this tradeoff may be cost effective. Also, since hard drives have a high infant mortality rate under low temperatures, initial testing should be done under low temperatures preshipment testing should be done under low temperatures.

The author were able to see that most age-related results are impacted by drive manufacturer and model. However, the authors do not show a breakdown of drives per manufacturer, model, or vintage “due to the proprietary nature of these data.” The only reason this information is “proprietary” is because Google wishes to maintain a competitive advantage over other companies purchasing large amounts of hard drives. This decision is the most unfortunate decision made by the authors of the paper. Making this information public would allow manufacturers to draw insights from specifically what conditions correlate with the failures based upon design or manufacturing decisions that were made on the particular drive model.

The long term value of this study will be in the publishing of periodic results allowing companies to use this knowledge to reduce costs and hard drive manufacturer to attempt design, manufacturing, and testing changes that decrease failures in real world conditions.

Failure Stories

The problem examined by these articles is how to deal with and prevent data center failures.

The articles give an analysis of why various data centers failed and how a similar failure will be prevented in the future. However, the problem with such an analysis is summed up by the Shumate article “Every data center is unique - there is no other like it. Every design is a custom solution based on the experience of the engineer and the facility executive.” The same article continues with an overview of design principles for a reliable data center. The fundamental lesson is that one cannot assure that a data center will not fail, and while the manager should try to have enough redundancy to minimize the chances of a failure, there should also be an approach to “fail small” as advised in the Miller article.

As we discussed in the previous articles, even the most fault tolerant hardware is not infallible. There is now an increased focus on providing fault tolerance through software and this model should be expanded across data centers so that even if an entire data center fails, the service should only be degraded by the fraction of the total available resources across data centers.

The case study of 365 Main’s outage is illustrative of what could have occurred using an approach that did not depend on the reliability of the power grid whether backup systems will respond effectively. The power outage affected “certain customers” (CraigsList, Technorati, LiveJournal, TypePad, AdBrite, the 1Up gaming network, Second Life and Yelp, among others) in 3 of 8 colocation rooms in 365 Main’s San Francisco data center. If 365 Main’s data center operated on a distributed model that allowed any machine to handle any request, the data center would still have been able to provide 5/8 of the full capacity to all of it’s customers. While one may argue that this would have been worse for the customers who were not affected, those customers realize that they were merely lucky and that they cannot rely upon 365 Main’s assurances of uninterrupted service. In fact, taking the approach of replication with distributed storage and service across data centers, 365 Main’s outage would not likely have affected customers’ service since 365 has 7 data centers.

Many companies have already realized that an architecture of clusters of thousands of commodity class PCs with software providing fault tolerance can deliver throughput performance at a fraction of the cost of a system built from fewer, more reliable, more expensive, high end servers. Replication across these clusters also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed. In the coming years, data center managers will realize that the resources devoted to “guaranteeing” that a data center will not fail would be better spent providing a greater number of computers at many different locations.