Saturday, October 24, 2009

Monday, October 19, 2009

Xen and the Art of Virtualization

Modern computers are sufficiently powerful to use virtualization to present the illusion of many smaller virtual machines (VMs), each running a separate operating system instance. Xen uses paravirtualization for the Virtual Machine monitor (VMM). A VMM is the layer between software and hardware, allowing more than one guest OS to run concurrently. Xen shows good performance and isolation of each VM, while allowing unmodified applications and supporting full operating systems.

When Xen began its development, it could not support a guest OS "as is", and patches to the guest kernel had to be applied. Today, architecture modifications allow unmodified kernels to run over Xen. The IA32 architecture uses a 4 ring privileges model, where the highest privilege (usually the OS, but now the Xen VMM) is running in ring 0, the rings 1 (now running the OS) and 2 are mostly unused, and ring 3 is home for the applications.

The Xen architecture is as follows:
-The hypercall interface allows domains to perform asynchronous software trap into the hypervisor to perform a privileged operation, analogous to the use of system calls in conventional operating systems.
-Xen allows guests to manage their own page tables (PTs), but limits the direct access to it to read-only. Modifying the PT is permitted with a Xen verification of the change.
-The hardware interrupt system is replaced by a lightweight event system.
-Xen contains no device drivers. Each driver request forwarded by a guest OS, is redirected by Xen to the Host OS device driver, allowing the guest OS to interact with the hardware while being hardware-independent.

The scalability provided by Xen's lightweight hypervisor allows for many VMs on each physical machine, which allows for efficient, fine-grained, and dynamic management of VMs. Xen’s network performance is important due to the need for multiple VMs running on separate physical machines. For example, an application’s need for 1000 machines may be serviced by 4000 VMs running on a quarter machine.

Please see a previous post for additional comments:
Virtualization vs. OS Separation

Wednesday, October 7, 2009

Parallel Database Systems

Parallel Database Systems: The Future of High Performance Database Systems
by David DeWitt and Jim Gray

I remember reading this paper years ago. Reading it again next to having read the MapReduce paper put the parallel DB mindset into a modern view. Anyone who has read the MapReduce paper should read this article.
MapReduce summary

These papers give perspective on how the problem can expand and be attacked more than a decade into the future. The Parallel DB paper goes over some barriers to speedup: startup, interference, skew. The scale of the problem and needed parallelism has increased dramatically since the parallel DB article was written (multi-PB vs multi-TB). There was also great foresight in this paper concerning the use of local computation on commodity processors and memory (shared-nothing architecture).

MapReduce has attempted to formalize and attach these problems and remove them from the input given by the requester. MapReduce can be thought of as a type of parallel SQL. MapReduce attacks each of the barriers to speedup: (see MapReduce summary)
Startup: MapReduce has streamlined a general purpose.
Interference: The MapReduce workers are distributed evenly enough and as close as possible to the data that must be operated upon.
Skew: MapReduce deals with the straggler program by restarting slow jobs.

Analysis: The Parallel DB paper tried to look ahead at the evolution of high performance DB systems. The authors did not, and could not have foreseen the exponential expansion of the internet. The scale of demand upon DB systems requires a change in mindset. Traditionally, the speed of RDMS has been increased in the background while maintaining all constraints. Today, we accept that there fundamental tradeoffs (see CAP Theorem). By loosening some of the constraints (such as strict consistency), we can achieve the needed performance provided by modern distributed storage systems.

Monday, October 5, 2009

Lottery Scheduling: Flexible Proportional-Share Resource Management

Lottery Scheduling: Flexible Proportional-Share Resource Management
Carl A. Waldspurger and William E. Weihl
I implemented lottery scheduling in my CS162 project. Lottery scheduling is a really simple and elegant idea. The ability to support a great deal of threads created a need for a better proportional-share/priority scheduling algorithm. Lottery scheduling works as follows: give each job some number of lottery tickets, on each time slice randomly pick a winning ticket, CPU time is proportional to number of tickets given to each job (on average). Lottery scheduling is probabilistically fair and behaves gracefully as load changes. Increasing the number of tickets also reduces the mean waiting times for each thread and gets closer to ideal fairness. There is even a form of priority donation that is accomplished by transferring a number of tickets to a client that that is blocking. The lottery scheduling mechanism achieves this fairness with relatively little overhead. A tree based lottery only needs to generate a random number and perform log(n) additions and comparisons to select a winner among n clients. Lotteries can also be used to manage many diverse resources: processor time, I/O bandwidth, and access locks. Lottery scheduling can even be used for scheduling communication resources.

Analysis: Lottery scheduling offers probabilistic guarantees for throughput and response time. This results in increased expected error (variability) in as the number of allocations increases. The Lottery Scheduling authors responded with Stride Scheduling to limit the variability and make the error independent of the number of allocations. Stride scheduling still uses tickets to determine proportional throughput among clients.

Monday, September 28, 2009

On Optimistic Methods for Concurrency Control

In this paper, the authors present two methods of non-locking concurrency controls for DBMS: serial and parallel validation. These methods are “optimistic” because they rely on the hope that conflicts will not occur. The authors first make it clear that locking approaches to concurrency control have numerous disadvantages (locking overhead, deadlock). The argument for “optimistic” concurrency control is as follows: reads are completely unrestricted since they can never cause a loss of integrity, and writes are restricted. A transaction consists of two or three phases: a read phase, a validation phase, and a possible write phase. These methods may be superior to locking methods for systems where transaction conflict is highly unlikely, such as query-dominant systems and very large tree indexes. Such an optimistic system would be inefficient where the transaction conflict is not rare. The paper suggests that a system should vary the amount of locking versus optimistic approaches as the likelihood of transaction conflict in the system varies. However, the authors do not delve into such an analysis.

Granularity of Locks and Degrees of Consistency in a Shared Data Base

This paper discusses the granularity of locks in a data base management system: small granularity means high concurrency but high overhead, large granularity means low overhead but low concurrency. A fine lockable system is preferable for a simple transaction which accesses few records and a coarse lockable system is preferable for a complex transaction which accesses many records. Therefore, it is favorable to have lockable units of different granularities coexisting in the same system. This can be accomplished with the use of hierarchical locks, which allow one to lock a node and all of its descendants. Various access modes can be used to ensure that two lock requests by two different transactions are compatible (can be granted concurrently). These ideas can be generalized to work for directed acyclic graphs (DAG), which are really just more general than trees. We also have to remember that indices and files are created and destroyed continually and so the lock graphs are dynamic. While there are issues that must be considered to ensure correctness of this added complication to a DBMS, it is worthwhile due to the need for the performance gain that will result.

Tuesday, September 22, 2009

Experience With Processes and Monitors in Mesa

This paper discusses how to deal with threads (processes running concurrently within a process). This is the period 30 years ago when concurrency was increasing in programs. This paper tried to tackle the issues involved with using monitors to synchronize these threads. The authors chose to synchronize processes with preemptive scheduling of light-weight processes and monitors: need to do I/O, allow for multiprocessors, allow information hiding between modules, more structured locking.

This paper made an attempt to work out the subtle issues involved in implementing a system running many light-weight processes concurrently. Today, the layer of the system that allows for such concurrency functions effectively and allows us to develop many complex systems with very large concurrency.

Wednesday, September 16, 2009

Lightweight Recoverable Virtual Memory

Goal: How simple can a transactional facility be, while remaining a potent tool for fault-tolerance? The authors omitted what they could without crippling RVM.
RVM is intended for Unix applications with persistent data and the paper is presented in three parts: rationale, architecture, and implementation.

Rational: Existing solutions, such as Camelot, were too heavy-weight. The solution was a light version that provides only recoverable virtual memory. The central principle the authors adopted in designing RVM was to value simplicity over generality. The authors followed to systems principle of building simple components that do things well.

Architecture: LRVM is managed in segments as opposed to pages. Regions of the segments are mapped to Unix virtual memory. There is no aliasing: mappings cannot overlap in virtual memory and no region of a segment may be mapped more than once by the same process. Regions can be unmapped at any time, as long as they have no uncommitted transactions outstanding. The designers focused on using meta-data to minimize the disk space that recoverable memory would occupy. Many other small features were also omitted.

Log Management: No undo/redo since value logging strategy because it never reflects uncommitted changes to an external data segment. Crash recovery is idempotent by being done last. Log truncation is also handled.
Optimizations: Intra-transaction optimizations arise when set-range calls specifying identical, overlapping, or adjacent memory addresses are issued within a single transaction. Inter-transaction optimizations occur only in the context of no-flush transactions.

Result: RVM handles more transaction/sec than Camelot in all cases. However, the performance significantly drops when recoverable memory size approaches physical memory size.

Analysis: A lightweight RVM can provide the necessary functionally, while improving performance and modularity. This lesson can be applied in general when building a complex system. Additional features can be added later as components built on top of the system. While this paper did not demonstrate this idea, complex distributed systems today, such as Google, are built with layered component blocks providing a given functionality.

Tuesday, September 15, 2009

Segment-Based Recovery: Write-ahead logging revisited

Large-scale distributed systems have challenges that are not adequately addressed by previous recovery systems, or previous architectures based on single systems in general. This paper proposes segment-oriented recovery to enable distributed recovery architectures that are not hindered by the tight coupling of components required by page-oriented recovery. The distributed variations are quite flexible and enable recovery to be a large-scale distributed service.
Page-oriented recovery leads to a tight coupling between the application, the buffer manager and the log manager. The tight coupling might be fine on a traditional single core machine, but it leads to performance issues when distributing the components to different machines and different cores. Segment-oriented recovery enables simpler and looser coupling among components. Write back caching reduces communication between the buffer manager and application, since the communication occurs only on cache eviction. Since there is no shared state, calls to the buffer manager and log manager can be asynchronous. The use of natural layouts for large objects allows DMA and zero-copy I/O in the local case. In the distributed case, this allows application data to be written without copying the data and the LSNs to the same machine. This allows for very flexible large-scale write-ahead logging as a service for cloud computing.
For small transactions, the networked version is roughly ten times slower than the local versions, but approximately 20 times faster than a distributed, page-oriented approach. As transaction sizes increase, segment-based recovery is better able to amortize network round trips due to log and buffer manager requests, and network throughput improves to more than 400 times that of the page-based approach. As above, the local versions of these benchmarks are competitive with local page-oriented approaches, especially for long transactions.

Analysis: The asynchronous approach is necessary for large-scale distributed systems. Also, recovery based on the granularity of application requests is more in line with the transactions in present large-scale Internet systems. However, the focus on large transaction size may not be valid for the cloud computing that this approach targets. The retrieval of a single web page can require communication with hundreds of small sub-services running on remote nodes. Perhaps some sort of lightweight version (we should consider letting go of some constraints if the application does not require them) would be preferable for the short transactions.

This up to date paper looking at modern challenges was well placed in the course
I hope we have more such updated papers throughout the course.

Tuesday, September 8, 2009

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging
ARIES attempts to better address the problem of guaranteeing atomicity and durability of transactions despite concurrent executions of multiple transactions and various failures. ARIES uses a log to record the progress of a transaction and its actions that change recoverable data. The method of logging allows partial or total rollbacks of transactions during both normal and restart processing. A lot of issues were discussed such as latches and locks, fine-granularity locking, and buffer management. The paper goes over the data structures used by ARIES, how ARIES processes transactions, restart processing, and recovery.
There are 3 phases in the Aries recovery algorithm:
-Analysis: Scan the log forward from the most recent checkpoint to identify all Xacts that were active, and all dirty pages in the buffer pool at the time of the crash.
-Redo: Redoes all updates to dirty pages in the buffer pool to ensure that all logged updates are carried out and written to disk.
-Undo: Working backwards in the log, the writes of all Xacts that were active at the crash are undone by restoring the before value of the update.
A crash occurring during the recovery process is also handled.
This paper discussed a detailed system for providing atomicity while still allowing us to have multiple transactions at once and recover from inevitable failures.

The HP AutoRAID Hierarchical Storage System

Management of redundant systems can be complicated. The HP AutoRAID attempts to provide a solution to this problem with a two-level storage hierarchy that is abstracted behind a single disk-array controller. Write-active data are mirrored for better performance while write-inactive data are stored in RAID 5 for better cost. The system begins by allocating mirrored space until no more can be stored this way and automatically reallocates some of the storage space to RAID 5. The HP AutoRAID adapts to the workload changes: newly active data are promoted to mirrored storage while data the have become less active are demoted to RAID 5. The HP AutoRAID also allows for easy disk upgrades of disks of different capacity, which allows for easy expansion and upgrade. The system can even have a second controller that it can automatically change over to in the case of failure of the primary controller. The system is simple to administer and setup. The HP AutoRAID system seems to do a good job of providing performance close to that of nonredundant systems for many workloads. At the same time, it provides full data redundancy and can tolerate failures of any single array component.
This system is a classic example of a system that does achieves its function while abstracting away how the function is carried out. Has this system become obsolete with the trend towards clusters of computers networked together? The cost of storage has gotten so cheap that RAID is not needed in such systems. Redundancy can be achieved by replicating data across multiple computers.

A History of Improving File System Performance

A Fast File System for UNIX
This paper begins by mentioning the types of applications, VLSI design and image processing, that do a small amount of processing on a large quantity of data and need to have a high throughput from the system. In the traditional UNIX file system, the inodes are not consecutively ordered which results in many disk accesses. To improve the file system throughput, a larger block size was used. But a larger block size results in a larger percentage of wasted space. The design team tried to counter this waste by putting small files in fragments within blocks. When this new file system was tested, there were order of magnitude increases in read and write bandwidth. The price that is paid is wasted disk space and increased CPU utilization. With CPU speed and disk size increasing exponentially and disk bandwidth lagging far behind, this seems like a desirable tradeoff.

The Design and Implementation of a Log-Structured File System
This paper reiterates the fact that processor speed is increasing exponentially, while memory and disk speeds are being left behind. While there are ways to increase disk throughput, there are no foreseeable improvements for access time. Information is spread around the disk, which causes too many disk accesses. So it seems that we should attempt to minimize disk accesses. A log-structured file system (LFS), which writes all modifications to disk sequentially, attempts to accomplish this goal.
A traditional file system achieves logical locality by assuming certain access patterns and pays extra on writes, to organize information optimally on disk for the assumed read patterns. In contrast, LFS achieves temporal locality: information that is created or modified at the same time will be grouped closely on disk. If temporal locality matches logical locality, as it does for a file that is written sequentially and then read sequentially, then a LFS should have about the same performance on large files as a traditional file system. If temporal locality differs from logical locality then the systems will perform differently: LFS will perform poorly on reads. This speeds up crash recovery because the system only needs to scan the most recent portion of the log, instead of the entire disk.
The main problem with LFS is degraded performance for logical reads, which is what journaling file systems try to address.

Analysis and Evolution of Journaling File Systems
This paper gives an overview of various journaling file systems in use today. Journaling tries to provide the best o both worlds: data is written to the log sequentially, then moved to is logical location. This provides fast crash recovery, while all normal reads occur from the standard blocks. Journaling is how modern commercial file systems work, such as Unix ext3 and Windows NTFS.

System R and Architecture of a DB Systems

System R
This paper talks about the initial origins of System R, goes through
the development of prototypes, and finally evaluates and draws
conclusions. The paper begins by talking about the relational data
model proposed by Codd in 1970. Codd proposed that data should be
represented by data values and never by any connections visible to the
user, and users should not have to specify algorithms when they make a
request. This is the original SQL DBMS. The question was whether the
second item could be achieved well. The paper continues by specifying
the key goals established for System R. The phase zero prototype was
just intended to determine the feasibility of design methods and
functions. Work then began on the phase one prototype. The paper
continues by talking about the details of the phase one prototype.
The evaluation of the phase one prototype was generally good and was
considered to have been successful in its goals: simplicity, power,
and independence of data. System R demonstrated that a production
database system was feasible. There is not much to criticize. This
is the foundation and inspiration for modern database systems. While
not as evolved as today’s production systems, the evolution has been a
progressive march over decades.

Architecture of a DB System
This is actually a good follow-up to the System R paper. System R
gives the original thinking when the first relational DBMS were being
developed. This paper goes over what has been learned/formalized over
the decades: process models, parallel architectures, ACID.
Essentially, this paper is an overview of various design
considerations that have come up in the history of DBMS.

I would be curious if we are going to discuss the implications of
today’s large scale systems. Not just your CAP theorem, but the
responses to these challenges (distributed file systems,
virtualization). Hopefully in a bit more detail than your syllabus
from last year. For example, with virtualization, could we look at
and discuss various approaches being considered to manage and spawn
VMs based upon need. What implications would this have with
distributed storage systems? It is not as simple as spawning VMs of a
stateless application layer.

Computer Systems Discussions

I am currently taking Eric Brewer's graduate computer systems class at Berkeley.
CS262a: Advanced Topics in Computer Systems

Monday, August 24, 2009

Wednesday, May 6, 2009

Virtualization vs. OS separation

There was a discussion point about virtualization software in class today:
There was a suggestion that we should advance OS research as opposed to using virtualization software due to the overhead involved. However, a return to basics may be in order. A review of the original IBM virtual machine paper may allow us to step back and get a better perspective.
Virtual storage and virtual machine concepts
A virtual machine is simply a machine sharing system. Over time, this has evolved into the OS we know today.

Virtualization software is the new datacenter OS:
-A strict separation between each cloud client is necessary to provide the illusion and security of running the application in isolation from other applications.
-A virtual machine (VM) isolating each application, which can scale to millions of user over many actual machines, is akin to the OS which isolates each application for one user.
-Virtualization software and support for it (by the CPU) should get to the point where there is no more overhead than using an OS to run multiple applications.
-Virtualization software must simply be a CPU/machine time/resource sharing software, but with a true wall between each VM.
-Communication between VMs on the same machine should be the same as communication between machines.
-Lightweight OS within a VM should be used to manage the processes involved with an application.

Paper comments:

While I agreed with some of their high level thinking, their comments on communication of VMs stood out:
"hence synchronization and protected control transfer are only necessary when two virtual machines wish to explicitly communicate."
"where VMs do communicate, they may not only be written in separate programming languages, but may also be running completely different operating systems."

The problem with these comments is that communication with VMs running the same application should be the common case. This is necessary to achieve scalability. I wonder if their suggestion to separate control and data paths is sufficient considering that there may be a great deal of small scale communication between VMs.

Monday, May 4, 2009

Intel CERN whitepaper

While this paper is marketing material from Intel, it makes some good points about the value of moving to multi-core chips and using virtualization to consolidate servers with low utilization

Problem: Most businesses already spend about half as much for the electricity to power and cool their infrastructure as they do for the hardware itself, and this percentage is expected to increase. This challenge is compounded by the design constraints of existing data centers, many of which are already running at or near thermal capacity. Unless energy efficiency is dramatically improved, organizations will be unable to expand their computing infrastructure without the expense and disruption of upgrading their data center, building a new one, or migrating to a co-location facility.
The goal was to maximize total performance per Watt for the computing infrastructure. This can allow datacenters to grow their computing, reduce their costs and extend the life of existing facilities.

Solution: CERN has found that its return on investment (ROI) is generally highest by optimizing datacenter performance/Watt. Multi-core processors based on the Intel Core microarchitecture deliver about five times more compute power per Watt than single-core processors based on the earlier Intel NetBurst microarchitecture. According to CERN, this move alone has already increased the useful life of its data center by about two years, enabling the organization to avoid the cost and disruption of adding a new facility.

This energy efficiency can be achieved with a basic understanding of circuit design. A small reduction in frequency causes a small reduction in the amount of work performed, but a relatively large drop in the amount of energy consumed. As a result, more cores running at lower frequencies can deliver substantial gains in total performance per Watt.

Virtualization can be used to consolidate smaller and infrequently used applications, which reduces the number of servers required for these secondary workloads. Better utilization provides energy efficiency gains, reduces data center footprints, and provides a more flexible and manageable infrastructure.

Conclusion: This paper brings up a couple of good points. While new datacenters may not have to optimize power consumption to the extreme of CERN, they should be aware of the long term consequences of their decisions, which may create limitations in the future. Also, while virtualization adds overhead, and is not preferred for situation where there is high server utilization for a particular application, virtualization can provide clear benefits when used to consolidate underutilized servers.

Microsoft PUE: Parts 1&2

Problem: Given the complexity of datacenter design and operation, energy efficiency changes must be closely monitored for overall effect. You can upgrade items to more energy-efficient equivalents, but unless you look at the big picture and understand how the pieces fit together, you could end up being disappointed with the outcome.

Solution: PUE is suggested as the indicator of whether efficiency actually got better or worse. PUE = (Total Facility Power)/(IT Equipment Power)
PUE is a simple metric to get a big picture view of datacenter efficiency design and cost effectiveness. Without a metric like PUE, the authors suggest that the engineer could not measure the datacenter efficiency to see if it had improved.

Analysis: Total datacenter costs must be considered: servers, infrastructure, running costs (energy, management, etc). In 2001, the sum of infrastructure and energy costs was equal to the cost of a 1U server. In 2004, the infrastructure cost alone was equal to the cost of the server. In 2008, just the energy cost was equal to the cost of a server.
PUE is a simple metric that analyzes datacenter efficiency in terms of how much overhead there is for a given set of servers in the datacenter. However, PUE neglects an analysis of the actual servers in the datacenter. Work/Watt is an important metric that cannot be neglected. It is likely that servers can be more easily upgraded than datacenter infrastructure.

Conclusion: PUE is a useful metric to analyze the overhead beyond the servers but is not the only necessary metric. Work/Watt is important. The power used by computer equipment to accomplish the work/computation is the central datacenter energy efficiency issue.

Wednesday, April 29, 2009

Ubuntu 9.04 beta out, now with fresh Eucalyptus

Cloud Economics

Problem: Economic allocation of cloud resources within a data center.

Kevin Lai from HP Labs came to discuss his work in this field. Kevin brought up some interesting point and I will summarize two: Optimization should be done across multiple layers and Using a bidding system to optimize provisioning of resources.

Optimization across multiple layers
Kevin argues that optimization cannot merely be done in each component: Application, Distributed storage/computing, Virtualization/Network/OS, Physical/Power. An overarching optimization must be done across these components with each layer coordinating with a general optimizer. However, abstraction has allowed for a reduction of complexity and innovation. Google is an excellent example. Many applications and services for applications (BigTable, Chubby) are built upon GFS. This modular architecture was built up over time as needed. There is also a great deal of innovation occurring in the open source space (Hadoop, etc). While overarching optimization may occur in the future, at this time, it may stifle innovation by preventing changes to a given component.

Bidding System
There is an argument to be made that some sort of bidding system may help mitigate supply and demand issues between cloud providers and customers. However, some customers may want cost guarantees provided by a flat rate per use.
The most interesting aspect of Kevin's proposals are using a bidding system to allocate resources within the data center. Such a bidding system can be used to create a predictability model which can tradeoff a bid for resources, QOS, and a guarantee (probability of completion). On average, this model can allow jobs to complete more work within a given time.
This bidding system can also be used to provision resources. Price inflation for a given resource is an indication that there is under provisioning.

Monday, April 27, 2009


Problem: Large scientific computations require many computers in order to effectively process the data.

Solution: BOINC is a software system that makes it easy for scientists to create and operate public-resource computing projects. This allows researchers to distribute the computation to millions of regular computer users through a client-side platform that runs when the computer is idle. The paper describes features that went into BOINC to make it easy for scientists and attractive for users. Incentives for users are a nice interface and a count of how much they are contributing to the project (Top Alien Hunters).

Influence: I remember running SETI@home 24/7 in the background on my P3 600MHz overclocked to 800MHz.
Main project website:
Listen for ET’s call:
Find gravity waves:


Problem: Distributing a large files to multiple users can overload the resources of a hosting machine.

Solution: BitTorrent is a peer-to-peer protocol that lets peers download pieces of the file from various peers and upload other pieces to them. This redistributes the cost of upload to the downloaders. Some of the challenges include figuring out which peers have what parts of the file and high churn rates. The rarest first technique and the choking and unchoking algorithms allow BitTorrent efficiency to mitigate the challenges.
Rarest First: Replicating the rarest pieces as quickly as possible reduces the risk of them getting completely lost as current peers stop uploading. It also makes sure that pieces which are more common are left for later, so the likelihood that a peer which currently is offering upload will later not have anything of interest is reduced.
Chocking/Unchocking: A good choking algorithm should utilize all available resources, provide reasonably consistent download rates for everyone, and be somewhat resistant to peers only downloading and not uploading. To avoid situations in which resources are wasted by rapidly choking and unchoking peers, BitTorrent peers recalculate who they want to choke once every ten seconds, and then leave the situation as is until the next ten second period is up. Ten seconds is a long enough period of time for TCP to ramp up new transfers to their full capacity.

Analysis: If there are few leechers, BitTorrent works very well. However, if a substantial number of peers tried to download without uploading, there would be problems. Not many users are very abusive today which is a real world demonstration of the prisoner’s dilemma.

Measuring and Evaluating Large-Scale CDNs

Problem: CDNs play a critical and central part of today’s Internet infrastructure. There is a need to measure the performance of large-scale commercial CDNs. Most major websites that use CDNs would be interested in the results. While CDNs have internal performance metrics they present to potential customers, this study is done from the outside, which may give websites more information on which to base their decisions.


Monday, April 20, 2009


Problem: There is a growing set of Internet-based services that are too big, or too important, to run at a single site. Examples include Web services for e-mail, video and image hosting, and social networking. This is the same reasoning behind Facebook’s need to scale out (see below).

Solution: WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. Different distributed applications might need different properties in a storage system: they might need to see the latest copy of some data, and be willing to pay a price in high delay, or they may want data to be stored durably, or have specific preferences for which site stores a document. Thus, in order to be a usable component in many different systems, a distributed storage system needs to expose a level of control to the surrounding application.
WheelFS allows applications to adjust the default semantics and policies with semantic cues, while maintaining the standard POSIX interface. WheelFS offers a small set of cues in four categories: placement, durability, consistency, and large reads.

Tradeoffs: Every file or directory object has a single “primary” WheelFS storage server that is responsible for maintaining the latest contents of that object. WheelFS maintains the POSIX interface even in the background and accessing a single file could result in communication with several servers, since each subdirectory in the path could be served by a different primary.
In addition, a disadvantage of cues is that they may break software that parses pathnames and assumes that a cue is a directory. Another is that links to pathnames that contain cues may trigger unintuitive behavior. However, they have not encountered examples of these problems.

Future Influence: The main insight from WheelFS is that different distributed applications need different properties in a storage system. In order to use have a distributed storage system as a usable component upon which many different applications are built, the storage system must allow the application to control various policies.
It is unclear whether a truly wide-area distributed storage system is the best approach. A middle ground may be found with distributed storage within each datacenter with updates pushed out to the other datacenters. This will create greater isolation between datacenters to prevent cascading failures.

Saturday, April 18, 2009

Scaling Out

Problem: As a website grows it scales up it’s datacenter. However, it will eventually need to scale out to other datacenters for a number of reasons, including space, power, and disaster recovery. With only one datacenter, in the event of a disaster, Facebook could be unusable for extended periods of time. Also, building datacenters in different geographic regions can reduce latency. For websites such as Facebook, this is a great concern because a page is built from a variety of sub-services.

Solution: The first step was to build out the servers and the physical space in Virginia. FB then brought up the intra-datacenter network and a low latency inter-datacenter fiber channel link. With the network and hardware in place FB set up their standard 3 tier architecture: web server, memcache server, and MySQL database. The MySQL databses in Virginia run as slaves of the west coast databases, so FB spent a couple weeks copying streams. After the hardware, network, and basic infrastructure was set up it was time to face the two main application level challenges: cache consistency and traffic routing.

Cache Consistency
FB’s caching model: when a user modifies a data object the infrastructure will write the new value in to a database and delete the old value from memcache (if it was present). The next time a user requests that data object they pull the result from the database and write it to memcache. Subsequent requests will pull the data from memcache until it expires out of the cache or is updated again.
Problem: Now let's say we delete the value from Virginia memcache tier at the time we update the master database in California. A subsequent read from the slave database in Virginia might see the old value instead of the new one because of replication lag. Then Virginia memcache would be updated with the old (incorrect) value and it would be “trapped” there until another delete. In the worst case the Virginia memcache tier would always be one “version” behind of the correct data.
Solution: FB made a small change to MySQL that allows them to tack on extra information in the replication stream that is updating the slave database. They used this feature to append all the data objects that are changing for a given query and then the slave database “sees” these objects and is responsible for deleting the value from cache after it performs the update to the database. The changes to MySQL are only 150 lines of code, so merging with a new version of MySQL is not difficult.

Page Routing
Problem: Only FB’s master databases in California could accept write operations. This fact means FB needed to avoid serving pages that did database writes from Virginia because each one would have to cross the country to the master databases in California. Fortunately, FB’s most frequently accessed pages (home page, profiles, photo pages) don't do any writes under normal operation. When a user makes a request for a page, FB must decide if it is “safe” to send to Virginia or if it must be routed to California.
Solution: One of the first servers a user request to Facebook hits is a load balancer. This load balancer has the capability to make routing decisions based on the user request. This feature meant it was easy to tell the load balancer about “safe” pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location.
Problem with Solution: Going to the edit profile.php page to change your hometown isn't marked as safe, so it gets routed to California and you make the change. Then you go to view your profile and, since it is a safe page, you get sent to Virginia. Because of the replication lag, you might not see the change you just made, which leads to double posting. FB resolves this concern by setting a cookie in your browser with the current time whenever you write something to the databases. The load balancer looks for that cookie and, if it notices that you wrote something within 20 seconds, will unconditionally send you to California.

Trade offs/Remaining Challenges: The main scaling challenge with this architecture is that all write operations must happen in one location. FB feels this approach does scale out to other datacenters, with the obvious limitation that there is a single master to accept writes. FB can just keep buying servers in that datacenter to keep up with the write traffic. The downside is the extra latency paid by users who are “far” from the datacenter on their write operations. In addition, this restriction requiring all writes to happen in one location results in a number of hacks to mitigate the problems, even in normal operation. Writes should be able to occur at any location with the updates being pushed out to other datacenters as they currently are from California.
If the west coast datacenter goes down FB is in very bad shape. FB is working on a disaster recovery procedure where they “upgrade” Virginia to the master but it's a tricky process because of all the one-off services FB has running in California that need to be properly replicated.

Conclusion: FB’s scaling out seems to be effective for the purpose of meeting the requests from their rapidly growing user base. The extra latency paid by users who are “far” from the datacenter on their write operations is an acceptable tradeoff for now, considering the rapid growth of their user base. However, the potential failure of the California datacenter is a frightening scenario. FB is a young company and has never been tested under such a scenario.

Future Influence: FB has given a good roadmap for scaling out quickly with a solution that works in serving user requests. Keeping the writes at a single data center can be sidestepped because they are more limited and can still be scaled up at that datacenter. However, in the future, FB will abandon this anchor holding it back in order to protect itself against the failure of the California datacenter and bring writes closer to where they are more likely to be used (where they were applied), minimizing latency and the need for as many hacks to resolve the issues it creates.

Wednesday, April 15, 2009

Open Clouds

Open Cloud Manifesto, Cloud: commodity or proprietary, Portable Cloud Computing, AppDrop all deal with the same problem.

Problem: The problem is that cloud providers are offering different clouds that differ in basic interfaces and level of access to underlying components. For example, Amazon's services are bottom up (here's a CPU and storage, now install your own software and use the components you need) and Google's is top down (write some code to these APIs and we'll scale it for you). One’s application code becomes dependent on a particular vendor.

Once an application becomes sufficiently complex, moving it from one cloud to another becomes difficult, placing folks at the mercy of their cloud provider. Cloud providers have every incentive to develop proprietary APIs in order to lock folks into their services. Also, moving lots of data between providers may make this difficult or expensive to do in practice (data inertia).

This is in contrast to most web applications today. The LAMP stack allows one to build vendor-neutral applications from free parts and select from a competitive, commodity hosting market.

Solution: The Open Cloud Manifesto goes over several challenges to cloud adoption, such as the portability and interoperability, that must be addressed through open collaboration. It suggests that this is the right time for industry participants to work to ensure that innovations should be guided by the principals of openness so that customers will feel confident in building their applications to run on cloud providers.

AppDrop is a container for applications developed with the Google App Engine SDK, to run entirely on Amazon's EC2 infrastructure, and they work without modification. However, this simple portability comes at the cost of scalability. The App Engine SDK doesn't use BigTable for its datastore, instead relying on a simple flat file on a single server. This means there is no scalabity, but for apps with limited resource needs, something as simple as AppDrop would work (of course, scalability is a primary reason for using a cloud).

Hadoop provides a lot of the building blocks for building cloud services. Its current focus is on batch computing, but several of its components are also key to cloud hosting. HDFS provides a scalable, distributed filesystem. HBase or Couch DB provide a database comparable to Amazon’s Simple DB and Google’s Datastore API.

Future Influence: The open-source community will likely play a significant role in interoperability and migration between clouds. Also, in order to take market share from competitors, cloud providers may build APIs that allow application builders to more easily move. AppDrop is a basic start. Amazon may consider building on this to more effectively tie in the Google App Engine API with Amazon's underlying services in order to provide the same scalability.

Wednesday, April 1, 2009


Problem: There is a need for languages that allow you to write reliable, scalable, production distributed systems.

Solution: What you need is true concurrency. You need lightweight processes, no shared memory, asynchronous message passing and mechanisms to change code on the fly so that programs can evolve and change as they run in non-stop systems.

Erlang’s main idea is to give you a more restrictive programming model provides good primitives for writing distributed applications while limiting features that can cause problems.

Concurrency - Erlang has extremely lightweight processes whose memory requirements can vary dynamically. Processes have no shared memory and communicate by asynchronous message passing. Erlang supports applications with very large numbers of concurrent processes. No requirements for concurrency are placed on the host operating system.

Distribution - Erlang is designed to be run in a distributed environment. An Erlang virtual machine is called an Erlang node. A distributed Erlang system is a network of Erlang nodes. An Erlang node can create parallel processes running on other nodes, which perhaps use other operating systems. Processes residing on different nodes communicate in exactly the same was as processes residing on the same node.

Robustness - Processes in a distributed system can be configured to fail-over to other nodes in case of failures.

Hot code upgrade - Many systems cannot be stopped for software maintenance. Erlang allows program code to be changed in a running system. So it is possible to install bug fixes and upgrades in a running system without disturbing its operation.

External interfaces - Erlang processes communicate with the outside world using the same message passing mechanism as used between Erlang processes.

Erlang’s evolution from a telecom environment means that it was developed a high degree reliability. Telecoms also require upgrades with no down time and handling/isolation of failures.

Criticism: “I think the problem with Erlang is that it's more than 20 years old and for most of this time haven't been exposed enough to developer community at large. It's like raising a child in a cellar for all its childhood and don't let it interact and learn from his/her peers.” This quote from another commenter sums up my central criticism of Erlang.

Tradeoffs: The main trade-off is that one gains the inherent concurrency and parallelization of processes while having to deal with an unwieldy 20 year old language. One should be able to use another programming language with a model of actors communicating by asynchronous messages and prohibiting shared state between actors.

Future Influence: Amazon, Facebook and other companies are now using Erlang for web systems. Erlang is being used because companies select the technology that is best suited to survive while evolving. I expect that other languages will be developed over time that will give us that will give us the best of modern languages and merely prohibit features that are not well handled in distributed systems, such as shared state.

Tuesday, March 31, 2009

Monday, March 30, 2009


Problem: Dynamic instrumentation of production systems to understand performance problems.

Solution: D-Trace provides on the order of 30,000 instrumentation points in the kernel alone. D-Trace provides an API to instrument both kernel and user level behavior, letting the same tracing program access both. The C-like control language for tracing scripts is a good contribution. D-Trace also provides easy access into kernel variables while keeping tracing code "safe".

DTrace’s primary difference from previous work is that it combines pieces of previous work into a single framework.

Future Influence: I would hope that other system builders develop their own tracing frameworks that contain key contributions from DTrace. This would be especially useful in large scale Linux deployments.


Problem: Current network diagnostic tools only focus on one particular protocol layer, and the insights they provide on the application cannot be shared between the user, service, and network operators.

Solution: The authors propose X-Trace, a cross-layer, cross-application tracing framework designed to reconstruct the user’s task tree.

While these tools are undoubtedly useful, they are also typically unable to diagnose subtle interactions between protocols or provide a comprehensive view of the system’s behavior.

X-Trace has significant limitations, such as implementing X-Trace requires modifications to clients, servers, and network devices and when X-Trace is only partially deployed the ability to trace those parts of the network is impaired, sometimes entirely.

The authors analyzed X-Trace in three deployed scenarios: DNS resolution, a three-tiered photo-hosting website, and a service accessed through an overlay network.

Future Impact: Looking beyond just one protocol layer is important for the direction systems are moving in; the multilayered, component approach. The need for modification in many places and that it is fully deployed will mean that you must build your own system or be locked into one provider.

Wednesday, March 18, 2009

Artemis: Hunting for Problems with Artemis

Problem: There is a need for a data collection system for monitoring and analyzing large distributed systems.

Solution: Artemis has the same goal as Chukwa. While Chukwa is built on Hadoop, Artemis (from Microsoft) is focused on analyzing Dryad (Microsoft) based distributed applications. Artemis seems to be fairly well along in terms of the log collection, storage, visualization, and analysis. Artemis has log collection from many different sources within Dryad.

Conclusion: The growth of distributed systems has resulted in a need for new tools to analyze large clusters of PCs, networked together, and running large distributed jobs. Systems such as Chukwa and Artemis are the first iterations of such tools. Artemis shows that these tools need to have access to data collection not only from each machine but from within the distributed processing jobs. This would be a good evolution for the Chukwa developers to follow; providing data collection from within MapReduce jobs.

Tuesday, March 17, 2009

Chukwa: A large-scale monitoring system

Problem: There is a need for a data collection system for monitoring and analyzing large distributed systems.

Solution: The authors developed Chukwa on top of Hadoop as an implementation of such a data collection system. The goal was to be able to monitor Hadoop clusters of 2000 nodes, outputting 5 to 6 MB of data per second, and to have collected data available for processing within ten minutes.

At the heart of any data collection system is a pipeline to pump data from where it is generated to where it is stored. Unfortunately, HDFS is not designed for the sort of workloads associated with monitoring. HDFS aims to handle large files and high write rates from comparatively small numbers of writers. It is not designed for thousands of concurrent low-rate writers, and millions of small files.

Much of the Chukwa design was driven by the need to reconcile our many sporadic data sources with HDFS’s performance characteristics and semantics. Adaptors are the data sources. Rather than have each adaptor write directly to HDFS, data is sent across the network to a collector process, that does the HDFS writes. Each collector receives data from several hundred hosts, and writes all this data to a single sink file. Collectors thus drastically reduce the number of HDFS files generated by Chukwa, from one per machine or adaptor per unit time, to a handful per cluster. The decision to put collectors between data sources and the data store has other benefits. Collectors hide the details of the HDFS file system in use, such as its Hadoop version, from the adaptors.


Problem: Many large websites have thousands to hundreds of thousands of machines across many geographical dispersed datacenters. These machines need to send log data to a central repository for analysis.

Solution: Scribe is Facebook's scalable logging system. Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination.

The next step is to do something useful with the data from the logging system. Identifying log messages with a category effectively presorts the data, allowing it to be logged to separate files based on that category.

Hive is Facebook's dataharehouse and the output of Scribe. Hive makes it easy for business analysts to ask ad hoc questions of terabytes worth of logfile data by abstracting MapReduce into a SQL like dialect.

Sunday, March 15, 2009


The Cloud Architectures whitepaper goes over various benefits to using a cloud provider and goes over design priciples for building a cloud application.

Business benefits to building applications using Cloud Architectures:
-Almost zero upfront infrastructure investment
-Just-in-time Infrastructure
-Efficient resource utilization
-Usage-based costing: Utility-style pricing allows billing the customer only for the infrastructure that has been used. The customer is not liable for the entire infrastructure that may be in place.
-Potential for shrinking the processing time: Parallelization is the one of the great ways to speed up processing.

Applications that could utilize the power of Cloud Architectures: processing pipelines, batch processing systems, websites.

This paper goes over some design principles for building a scalable application for a cloud: use scalable components that are loosely coupled, take advantage of parallelization, requisition and relinquish resources on-demand, and use resilient designs because hardware will fail. The paper uses Amazon's implementation of GrepTheWeb to highlight design lessons and principles.

Google Apps Security

Safekeeping of data for tens of millions of users can be a serious challenge for large, complex systems. The argument for using a large cloud provider is that it has more resources to ensuring security and has a broad range of experience in terms of security, especially compared to small or medium organizations.

This paper looks at Google’s security strategy and is an overview of what Google considers the main focus areas and design principles. The paper tries to convince perspective customers that its cloud infrastructure is safe.

Google focuses on several aspects of security that are critical:
• Organizational and Operational Security – Policies and procedures to ensure security at every phase of design, deployment and ongoing operations.
• Data Security – Ensuring customer data is stored in secure facilities, on secure servers, and within secure applications.
• Threat Evasion – Protecting users and their information from malicious attacks and would-be hackers.
• Safe Access – Ensuring that only authorized users can access data, and the access channel is secure.
• Data Privacy – Ensuring that confidential information is kept private and confidential

Wednesday, March 4, 2009

Dryad and DryadLINQ

Problem: Companies like Microsoft, Google, Yahoo, etc have large data sets that can only be effectively processed on large clusters of PCs. The queries being applied to these data sets are growing more complex and a generalized version of MapReduce is needed.

Solution: Microsoft attempts to create its own language for a more generalized version of MapReduce in the line of Yahoo's Pig Latin.

Future Influence: It is unclear if Dryad and anything built upon it will gain acceptance outside of Microsoft. However, the need for a simplified method of running complex queries on large data sets on clusters of PCs will remain and there will be many efforts to develop solutions.

MapReduce: Simplified Data Processing on Large Clusters

Problem: Companies like Google have large data sets that can only be effectively processed on large clusters of PCs. The work must be broken into parts that can be processed in a reasonable time and the result must then be reassembled.

Solution: Google developed an abstraction called MapReduce that allows them to express the simple computations they were trying to perform but hides the details of parallelization, fault-tolerance, data distribution and load balancing.

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines:
1. The MapReduce library in the user program first splits the input files into M pieces and then starts up many copies of the program on a cluster of machines.
2. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It passes the input data to the user-defined Map function. The intermediate result produced by the Map function is buffered in memory.
4. Periodically, the buffered pairs are written to local disk and are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it reads the buffered data from the local disks of the map workers. The reduce worker reads all intermediate data and sorts it. The sorting is needed because typically many different keys map to the same reduce task.
6. The reduce worker iterates over the sorted intermediate data and passes the data to the user’s Reduce function. The output of the Reduce function is appended to a final output file.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

Locality of computation and data is a common theme repeated by many industry researchers. Network bandwidth is a relatively scarce resource in our computing environment. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task’s input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.

The authors also describe many operational insights that might not be initially considered. One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes. This significantly reduces the time to complete large MapReduce operations at a cost of only a few percent to computational resources. For example, a sort program takes 44% longer to complete when the backup task mechanism is disabled.
Skipping Bad Records: sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set.
To help facilitate debugging, profiling, and small-scale testing, Google has developed an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine.

MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed or parallel systems to exploit large amounts of resources easily. For example, the size of one phase of the computation dropped from approximately 3800 lines of C++ code to approximately 700 lines when expressed using MapReduce. Also, most of the problems caused by machine failures, slow machines, and networking are dealt with automatically by the MapReduce library without operator intervention.

Future Influence: MapReduce has been highly influential and followed by many other companies with a need to process large data sets in a parallel way on large clusters of commodity PCs. The simply Map then Reduce function is being generalized for other more complex parallel data processing (Pig Latin from Yahoo, Dryad from Microsoft). This continuing research will help simplify large scale parallel data processing in the years to come by breaking up the processing into chunks that can be effectively parallelized on large clusters of PCs.

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin is a programming language developed at Yahoo Research which allows for writing parallel-processing functions in a functional, object/variable programming style that is familiar to most programmers. Typical Pig programs have 1/20 the lines of code of common Hadoop/MapReduce code, 1/16 the development time, with only 1.5x performance hit. Such a programming language would be useful in rolling out new applications and features. If a feature becomes popular, it can be optimized. Pig is still young and optimizations will likely occur.
We had a good talk from Christopher Olston at Yahoo Research on this subject. The most valuable lessons from this talk and the other industry speakers have been the lessons learned from working on large scale implementations. One important lesson is that such projects need to be broken up into layers that perform a simple function well. Another important lesson comes from the types of data sets involved. While much work centers around data processing on large data sets, a great deal of processing occurs on a combination of large and small data sets. This has significant implications for design decisions of the distributed storage layer (replication and computation locality).

Wednesday, February 25, 2009

Dynamo: Amazon’s Highly Available Key-value Store

Problem: It is difficult to create redundancy and parallelism with relational databases, so they become a single point of failure. Also, as a relational database grows, it becomes a bottleneck for the entire system. The very survival of Amazon’s business depends on common, bullet proof, flexible, and scalable software systems.

Solution: Unlike a relational database, Dynamo is a distributed storage system. Any node in the system can be issued a put or get request for any key. Dynamo is an eventually consistent storage system because if one computer updates object A, these changes need to propagate to other machines.

Physical nodes are thought of as identical and organized into a ring (built on Chord). The partitioning mechanism automatically scales as nodes enter and leave the system. Every object is asynchronously replicated to N nodes. The updates to the system occur asynchronously and may result in multiple copies of the object in the system with slightly different states. The discrepancies in the system are reconciled after a period of time, ensuring eventual consistency.

Dynamo can be tuned using just a handful parameters to achieve different, technical goals that in turn support different business requirements. Dynamo is a storage service in the box driven by an SLA. Different applications at Amazon use different configurations of Dynamo depending on their tolerance to delays or data discrepancy. Each data object is replicated across multiple nodes with timestamp based reconciliation.

Future Influence: Dynamo is a large scale implementation of a distributed storage system. The experiences from all of the companies that have built such systems out of necessity (Google, Amazon, Yahoo, etc) will prove valuable for the development of future systems.
I liked the related work section.

Tuesday, February 10, 2009



GFS: The Google File System

Problem: Google required a scalable distributed file system for large distributed data-intensive applications. This file system had to provide fault tolerance while running on inexpensive commodity hardware, and deliver high aggregate performance to a large number of clients.

Solution/Analysis: The view of GFS begins with an interface that allows users to have access to basic file commands. These include commands like open, create, read, write and close files. The team also included a couple of specialized commands: append and snapshot. Append allows clients to add information to an existing file without overwriting previously written data. Snapshot creates quick copy of a computer's contents.

Google organized the GFS into clusters of computers, built from many inexpensive commodity components that often fail. Within GFS clusters there are three kinds of entities: clients, master servers and chunkservers. A "client" refers to any entity that makes a file request. Requests can range from retrieving and manipulating existing files to creating new files on the system. Clients can be other computers or computer applications.

Master server: The master server acts as the coordinator for the cluster. The master's duties include maintaining an operation log, which keeps track of the activities of the master's cluster. The operation log helps keep service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored the operation log can take its place. The master server also keeps track of metadata, which is the information that describes chunks. The metadata tells the master server to which files the chunks belong and where they fit within the overall file. Upon startup, the master polls all the chunkservers in its cluster. The chunkservers respond by telling the master server the contents of their inventories. From that moment on, the master server keeps track of the location of chunks within the cluster.
There's only one active master server per cluster at any one time (though each cluster has multiple copies of the master server in case of a hardware failure). GFS avoids a bottleneck by minimizing the messages the master server sends and receives. The master server doesn't actually handle file data at all, which is up to the chunkservers.

Chunkserver: GFS files tend to be very large, usually in the multi-gigabyte (GB) range. GFS breaks files up into manageable chunks of 64 megabytes (MB) each. Every chunk receives a unique 64-bit identification number called a chunk handle. Chunkservers are responsible for storing the 64-MB file chunks. Chunkservers send requested chunks directly to the client. GFS copies every chunk multiple times and stores it on different chunkservers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired. I wonder if GFS has been updated with some policy for dynamic replication based on need.

File Requests: GFS separates replicas into two categories: primary replicas and secondary replicas. There does not seem to be a fundamental need for this differentiation.

Read request: The client sends a read request to the master server to find out where the client can find a particular file on the system. The server responds with the location for the primary replica of the respective chunk. This is reminiscent of how the old Napster worked: go to a directory (master server) which tells you where to get the data. Modern distributed file systems can give you the “nearest” copy as well.

Write requests: The client sends a request to the master server, which replies with the location of the primary and secondary replicas. If a client creates a write request that affects multiple chunks of a particularly large file, the GFS breaks the overall write request up into an individual request for each chunk.

Stale Chunks: Whenever a client changes a chunk, the master server assigns it a new number. This is how the master server can tell the difference between valid chunks and invalid copies known as stale chunks.

Reliable system built on unreliable components: The GFS components give system updates through electronic messages called heartbeats and handshakes. These handshakes allow the master server to stay current with each chunkserver's status. The GFS developers built functions into the system to compensate for the inherent unreliability of individual components. Those functions include master and chunk replication, a streamlined recovery process, rebalancing, stale replica detection, garbage removal and checksumming.

Bandwidth focus: Google developers are greatly concerned with bandwidth because Google applications manipulate very large files and many users are making requests. By lagging behind the leading edge of hardware technology, Google can purchase equipment and components at bargain prices. The structure of the GFS is such that it's easy to add more machines at any time and rebalance the workload.

Future viability: GFS seems to provide Google the necessary file system for an application such as search. However, this paper was written five years ago and already Google has expanded into providing many different applications. Google’s assumptions about file use will not hold under many new applications which will require frequent updates to data which will need to be made available for access. This paper does provide a basic principle that will hold: data can only be made reliably available to many clients through replication across many servers, each of which can independently serve the data to the client.

Wednesday, February 4, 2009

Ebay: Scale!

Problem: Ebay’s problem is the same problem shared by any large site in the internet today; how to handle many users accessing large amounts of services/data, with the data being updated millions (or billions) of times per day.

Solution: Ebay’s solution rely’s upon 5 principles that Ebay has almost embraced: partitioning, asynchrony, automation of configuration, acceptance of failure, and (almost) embracing inconsistency. My main criticism of the Ebay work rest upon the requirement of immediate consistency for bids and purchases. I wonder how seriously they have considered almost immediate consistency resolved by the application with a timestamp when “immediate consistency” is required at present.

I am heartened by Ebay’s use of 3rd-party cloud services. I am curious as to the present implementation details. I noticed the mention of event-based transactions in “eBay’s Scaling Odyssey”, but there are scant details. It seems that eBay has separated the application from the storage in delivery of their service, but I wonder how fundamental their separation is in implemenation.

Finally, I am curious as to how open eBay would be to a completely distributed storage layer.

Future influence: I think that Ebay’s work will be highly influential in the coming decade. Ebay seems willing to embrace a separation between the application and data in order to provide a scalable dynamic service to many millions of users per day. Ebay’s present work is a stepping stone to a truly scalable model of services and sub-services that will be adopted by any similar large deployment out of necessity.

I would like some comment from the class on my thoughts here. We should embrace the blog structure on the paper comments for sharing ideas.

A Policy-Aware Switching Layer for Data Centers

Problem: Managing an enterprise or data center network is difficult because middleboxes must be placed on the path of the traffic they must filter. This is a not a problem for some companies who do not have a great need for firewalls, etc. But if you have a system with many firewalls, middleboxes are a problem. Financial institutions have multiple layers with many middleboxes. Middleboxes are placed at a chokepoint in the network. Info must pass a particular point for firewalls and load balancers.

Solution: The solution proposed by the authors takes the middleboxes out of the network path. The solution specifically routes the data through the firewall. Implicit routing causes one to route the packet through a path with the firewall. Explicit routing through the middlebox is necessary to ensure the data goes through the middlebox.

Tradeoffs: The solution uses a little extra bandwidth. But explicit routing ensures correctness. Other solutions to the bottleneck of middleboxes involve throwing resources at the problem; more middleboxes at the bottlenecks.

Future Influence: The most important contribution of this paper is that it makes the case higher-level policy control in a data center, specifically the use of indirection within a datacenter. Many data center managers are happy with what they have, but this is less of a hurdle than introducing a new architecture in the internet. Such a new proposal is possible in a datacenter because there are many new clean slates (including upgrades being a clean slate).

Tuesday, February 3, 2009

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

This paper attempts to solve the same problem as A Scalable, Commodity Data Center Network Architecture: the high cost of high-end non-commodity switches and the lack of full bandwith across the entire datacenter. DCell takes the network architecture into the servers. DCell also requires that each server have 2 GigE connections, but this is achieved with a readily available quad port Ethernet adapter. In data centers with tens to hundreds of thousands of servers, it would be desirable to cost-reduce with a Dual Port Ethernet adapter. The modifications to servers would not likely be a drawback since such a setup would be used in a new deployment (the interconnection costs have already been sunk in existing deployments).

The authors focus on three design goals for data center networks (DCN). First, the network infrastructure must be scalable to a large number of servers and allow for incremental expansion. Second, DCN must be fault tolerant against various types of server failures, link outages, or server-rack failures. Third, DCN must be able to provide high network capacity.

The main benefits of the DCell approach is that it has a great degree of fault tolerance and can scale to millions of servers with only a level 3 DCell and truly commodity 8-port switches.

The main drawback of this approach is the complexity of the wiring interconnections for even a DCell structure, especially when repeated many times, and up to a level 2 or higher DCell. This provides an opportunity for modular data centers pre-configured in DCells from the manufacturer. These modular DCells would have clear instructions (possibly color-coded) for interconnecting the modular DCells to one another and to the outside world.

A Scalable, Commodity Data Center Network Architecture

Internet services increasingly employ service oriented architectures, where the retrieval of a single web page can require coordination and communication with many individual sub-services running on remote nodes. Present, data center networks are expensive due to the use of non-commodity switches at the highest levels. Even with the high cost of the switches, bandwidth can become a bottle-neck if many servers from one section of the network try to communicate with servers in another section.

In this paper, the authors show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. The paper attempts to design a data center communication architecture that meets the following goals: scalable interconnection bandwidth, achieve economies of scale using commodity ethernet switches, backward compatibility with hosts running Ethernet and IP.

Today, the price differential between commodity and non-commodity switches provides a strong incentive to build large-scale communication networks from many small commodity switches rather than fewer larger and more expensive ones. (The table showing the cost decrease of GigE and 10 GigE is very illustrative of the strength of the authors' argument.) Also, the commodity switches use less power and give off less heat. Use of commodity switches is even more compelling due to the increasing data center energy/heat density. The authors make an important point that an approach that uses commodity switches will be the only way to deliver full bandwidth for large clusters once 10 GigE switches become commodity at the edge.

Another strong motivation to use such an interconnected network topology is the fault-tolerance provided by multiple paths. Such a network would be able to continue for some time without immediate repair, or indefinitely if sealed in a modular data center.

The authors chose to ignore wiring costs in their evaluation of their strategy. However, the complexity involved in properly wiring up such a system is not trivial, since many setup errors frequently occur in data centers. This actually provides an opportunity for manufacturers of switches to sell prepackaged solutions with the switches preloaded with the modified routing algorithms and pre-wired interconnections.

In the future, bandwidth will increasingly become the scalability bottleneck in large-scale clusters. Existing solutions for addressing this bottleneck center around hierarchies of switches, with expensive, non-commodity switches at the top of the hierarchy. Larger numbers of commodity switches have the potential to displace high-end switches in data centers in the same way that clusters of commodity PCs have displaced high-end servers. While this specific approach may not be taken, this paper and the DCell paper will hopefully inspire more research on the subject.

Designing a highly available directory service

This article provides strategies for dealing with failures in a directory service.

The chapter broadly defines failure as anything that prevents Directory Server from providing the minimum level of service required. Failure is divided into two main areas: system unavailable and system unreliable.

The chapter provides a few strategies for addressing failures in a directory service. The paper classifies failures in two classes: system unavailable and system unreliable (classification I did not find particularly enlightening). The paper presents two backup strategies: binary (which is full binary replication) and LDIFF (a logical backup based on differences from previous version). Binary replication is a full backup is faster than LDIF. LDIF has greater control over the granularity but in situations where rapid restoration is required, LDIF may take too long to be viable.

The paper then describes replication topologies starting from a single data center up to 5 data centers. The high level idea is to have one or two master servers per data center, linked by an interconnected topology with redundant links and potential backup links. The architecture seems to allow for 4 master servers, even in the Five Data Center Topology (but does not explain why).

This article gave a good perspective about the current industry thinking for replication and backup. It would be interesting to see how well such a model scales. Their consistency algorithm is probably manually configured based upon the interconnections and where the master servers are placed. As a result, it is unlikely to scale well in the multi-thousand server clusters that are becoming prevalent.

Sunday, February 1, 2009

Failure Trends in a Large Disk Drive Population

This paper attempts to understand factors that affect the reliability of high-capacity magnetic disk drives. Such understanding is useful for guiding the design of storage systems and data centers. This study used a much larger population size than previous studies and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime.

The study looked at various factors such as utilization, temperature, and SMART signals (self-monitoring signals).

The authors chose to measure utilization in terms of weekly averages of read/write bandwidth per drive. Other than the infant mortality, the data indicates a weaker correlation between utilization levels and failures than previous studies. It would be interesting to see if there is any correlation between failure rates and number of read/write since less stress may be placed on a drive that makes long continuous reads.

The authors reported temperature effects only for the high end of the temperature range (greater than 45 degrees C) and especially for older drives. This result could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. Datacenters can allow for higher temperatures to lower cooling costs with minimal effect on drive failures. In a system that can accept such failures, this tradeoff may be cost effective. Also, since hard drives have a high infant mortality rate under low temperatures, initial testing should be done under low temperatures preshipment testing should be done under low temperatures.

The author were able to see that most age-related results are impacted by drive manufacturer and model. However, the authors do not show a breakdown of drives per manufacturer, model, or vintage “due to the proprietary nature of these data.” The only reason this information is “proprietary” is because Google wishes to maintain a competitive advantage over other companies purchasing large amounts of hard drives. This decision is the most unfortunate decision made by the authors of the paper. Making this information public would allow manufacturers to draw insights from specifically what conditions correlate with the failures based upon design or manufacturing decisions that were made on the particular drive model.

The long term value of this study will be in the publishing of periodic results allowing companies to use this knowledge to reduce costs and hard drive manufacturer to attempt design, manufacturing, and testing changes that decrease failures in real world conditions.

Failure Stories

The problem examined by these articles is how to deal with and prevent data center failures.

The articles give an analysis of why various data centers failed and how a similar failure will be prevented in the future. However, the problem with such an analysis is summed up by the Shumate article “Every data center is unique - there is no other like it. Every design is a custom solution based on the experience of the engineer and the facility executive.” The same article continues with an overview of design principles for a reliable data center. The fundamental lesson is that one cannot assure that a data center will not fail, and while the manager should try to have enough redundancy to minimize the chances of a failure, there should also be an approach to “fail small” as advised in the Miller article.

As we discussed in the previous articles, even the most fault tolerant hardware is not infallible. There is now an increased focus on providing fault tolerance through software and this model should be expanded across data centers so that even if an entire data center fails, the service should only be degraded by the fraction of the total available resources across data centers.

The case study of 365 Main’s outage is illustrative of what could have occurred using an approach that did not depend on the reliability of the power grid whether backup systems will respond effectively. The power outage affected “certain customers” (CraigsList, Technorati, LiveJournal, TypePad, AdBrite, the 1Up gaming network, Second Life and Yelp, among others) in 3 of 8 colocation rooms in 365 Main’s San Francisco data center. If 365 Main’s data center operated on a distributed model that allowed any machine to handle any request, the data center would still have been able to provide 5/8 of the full capacity to all of it’s customers. While one may argue that this would have been worse for the customers who were not affected, those customers realize that they were merely lucky and that they cannot rely upon 365 Main’s assurances of uninterrupted service. In fact, taking the approach of replication with distributed storage and service across data centers, 365 Main’s outage would not likely have affected customers’ service since 365 has 7 data centers.

Many companies have already realized that an architecture of clusters of thousands of commodity class PCs with software providing fault tolerance can deliver throughput performance at a fraction of the cost of a system built from fewer, more reliable, more expensive, high end servers. Replication across these clusters also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed. In the coming years, data center managers will realize that the resources devoted to “guaranteeing” that a data center will not fail would be better spent providing a greater number of computers at many different locations.

Wednesday, January 28, 2009

Facebook Presentation

Facebook must handle large amounts of data and do data analysis on that data and how it is used. Data analysis is hard to do on data for the site because users want to pull the data and pulling data for analysis would slow down the site.

This is a general talk about data analysis involving very large amounts of data.

Facebook has built numerous tools to analyze data involving statistical tools, visual graphs, varying timescales, etc. Building these tools was necessary to automate the analysis, so a small number of people could do the actual analysis.

Large scale log file analysis is easier than doing analysis on a database that is updated frequently. You do not need to keep historical info in the database because it is never retrieved by the site.

Facebook is looking at distributed databases. Commodity hardware is used for the data center.

Data analysis must provide answers to questions from finance, marketing, etc. Don't collect data without a purpose; the amount of data collected can become overwhelming. Focus on what you can learn from your data.

This presentation gave a lot of general principles for data analysis. This will become more important in the coming years and more companies need to deal with such large amounts of data and attempt to analyze the data and learn something from it.

An Architecture for Modular Data Centers

Not everyone is as big as Google with trained teams running around managing hardware failures in their systems. Expanding capacity can be difficult, especially if one does not have sufficient trained personnel. Even with an up and running data center, on-site hardware service is expensive in that skilled service personnel must be available at each data center. Staffing a data center with full time service technicians is seldom cost-effective unless the facility is very large. Most services contract this work out and can spend 25% of the system’s price over a 3-year service life. Also, human administrative error can increase system outages.

The proposed solution is to no longer build and ship single systems or even racks of systems. Instead, the supplier ships macro-modules consisting of a thousand or more systems. Each module is built in a 20-foot standard shipping container, configured, burned in, and delivered as a fully operational module with full power and networking in a ready to run no-service-required package. All that needs to be done upon delivery is provide power, networking, and chilled water.

The tradeoff with a “no-service-required” package is that parts are not replaced as they fail. The modules are self-contained with enough redundancy that, as parts fail, surviving systems continue to support the load. The components are never serviced and the entire module just slowly degrades over time as more and more systems suffer non-recoverable hardware errors. Such a module requires that software applications implement enough redundancy so that individual node failures don’t negatively impact overall service availability (my initial thoughts on this subject are in the Google Cluster summary). Since the model of using commodity class PCs and tolerating hardware failures in software is already growing, these modules do not pose additional significant challenges in that regard.

Such work by Microsoft, Rackable Systems, Sun Microsystems, and others will likely become even more important in 10 years to companies delivering services over the web that want to have control of their datacenter clusters with the ability to scale up quickly, while still minimizing maintenance costs.

Tuesday, January 27, 2009

Web Search for a Planet: The Google Cluster Architecture

Google is attempting to solve the problem of how to deliver its bread and butter service: web search. Google must deliver a search quickly while still maintaining a low cost.

Google’s main idea is to use clusters of thousands of commodity class PCs and tolerating hardware failures in software. Such an architecture delivers superior throughput performance at a fraction of the cost of a system built from fewer, more expensive, high end servers. As a result, Google can afford to use more computational resources per search query.

Google also aggressively exploits the very large amounts of inherent parallelism in web search. For example, Google transforms the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. By parallelizing the search over many machines, Google reduces the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. As a result, Google purchases the CPU that currently gives the best performance per unit price, not the CPUs that give the best absolute performance. This replication also addresses the inherent unreliability of each machine. Any failures will merely degrade the overall performance by the faction of the machines that have failed.

Google’s main assumption in a search is that accesses to the index and other data structures involved in answering a query are read-only. This avoids the consistency issues that arise in using a general-purpose database. While this is valid for web search, many applications will require updates to the data. Separating the computation task from the storage task and using a completely distributed storage system may be a solution to the problem, assuming we are willing to accept eventual consistency for the replicas of our data.

This paper was written five years ago, but the problem addressed by the paper is still valid. Today, the problem is even more prevalent (Google alone is estimated to have 500,000 systems in 30 data centers world-wide) and will continue in the future with the rise of many large data centers from different companies. There is an increased focus by a variety of companies on providing fault tolerance through software because even the most fault tolerant hardware is not infallible and using commodity hardware can provide a more cost effective solution.

Cloud Computing test post

three, four, five, six, seven, eight,...

My home page: