Pete's Thoughts on Cloud Computing

Monday, September 28, 2009

On Optimistic Methods for Concurrency Control

In this paper, the authors present two methods of non-locking concurrency controls for DBMS: serial and parallel validation. These methods are “optimistic” because they rely on the hope that conflicts will not occur. The authors first make it clear that locking approaches to concurrency control have numerous disadvantages (locking overhead, deadlock). The argument for “optimistic” concurrency control is as follows: reads are completely unrestricted since they can never cause a loss of integrity, and writes are restricted. A transaction consists of two or three phases: a read phase, a validation phase, and a possible write phase. These methods may be superior to locking methods for systems where transaction conflict is highly unlikely, such as query-dominant systems and very large tree indexes. Such an optimistic system would be inefficient where the transaction conflict is not rare. The paper suggests that a system should vary the amount of locking versus optimistic approaches as the likelihood of transaction conflict in the system varies. However, the authors do not delve into such an analysis.

Granularity of Locks and Degrees of Consistency in a Shared Data Base

This paper discusses the granularity of locks in a data base management system: small granularity means high concurrency but high overhead, large granularity means low overhead but low concurrency. A fine lockable system is preferable for a simple transaction which accesses few records and a coarse lockable system is preferable for a complex transaction which accesses many records. Therefore, it is favorable to have lockable units of different granularities coexisting in the same system. This can be accomplished with the use of hierarchical locks, which allow one to lock a node and all of its descendants. Various access modes can be used to ensure that two lock requests by two different transactions are compatible (can be granted concurrently). These ideas can be generalized to work for directed acyclic graphs (DAG), which are really just more general than trees. We also have to remember that indices and files are created and destroyed continually and so the lock graphs are dynamic. While there are issues that must be considered to ensure correctness of this added complication to a DBMS, it is worthwhile due to the need for the performance gain that will result.

Tuesday, September 22, 2009

Experience With Processes and Monitors in Mesa

This paper discusses how to deal with threads (processes running concurrently within a process). This is the period 30 years ago when concurrency was increasing in programs. This paper tried to tackle the issues involved with using monitors to synchronize these threads. The authors chose to synchronize processes with preemptive scheduling of light-weight processes and monitors: need to do I/O, allow for multiprocessors, allow information hiding between modules, more structured locking.

This paper made an attempt to work out the subtle issues involved in implementing a system running many light-weight processes concurrently. Today, the layer of the system that allows for such concurrency functions effectively and allows us to develop many complex systems with very large concurrency.

Wednesday, September 16, 2009

Lightweight Recoverable Virtual Memory

Goal: How simple can a transactional facility be, while remaining a potent tool for fault-tolerance? The authors omitted what they could without crippling RVM.
RVM is intended for Unix applications with persistent data and the paper is presented in three parts: rationale, architecture, and implementation.

Rational: Existing solutions, such as Camelot, were too heavy-weight. The solution was a light version that provides only recoverable virtual memory. The central principle the authors adopted in designing RVM was to value simplicity over generality. The authors followed to systems principle of building simple components that do things well.

Architecture: LRVM is managed in segments as opposed to pages. Regions of the segments are mapped to Unix virtual memory. There is no aliasing: mappings cannot overlap in virtual memory and no region of a segment may be mapped more than once by the same process. Regions can be unmapped at any time, as long as they have no uncommitted transactions outstanding. The designers focused on using meta-data to minimize the disk space that recoverable memory would occupy. Many other small features were also omitted.

Implementation:
Log Management: No undo/redo since value logging strategy because it never reflects uncommitted changes to an external data segment. Crash recovery is idempotent by being done last. Log truncation is also handled.
Optimizations: Intra-transaction optimizations arise when set-range calls specifying identical, overlapping, or adjacent memory addresses are issued within a single transaction. Inter-transaction optimizations occur only in the context of no-flush transactions.

Result: RVM handles more transaction/sec than Camelot in all cases. However, the performance significantly drops when recoverable memory size approaches physical memory size.

Analysis: A lightweight RVM can provide the necessary functionally, while improving performance and modularity. This lesson can be applied in general when building a complex system. Additional features can be added later as components built on top of the system. While this paper did not demonstrate this idea, complex distributed systems today, such as Google, are built with layered component blocks providing a given functionality.

Tuesday, September 15, 2009

Segment-Based Recovery: Write-ahead logging revisited

Large-scale distributed systems have challenges that are not adequately addressed by previous recovery systems, or previous architectures based on single systems in general. This paper proposes segment-oriented recovery to enable distributed recovery architectures that are not hindered by the tight coupling of components required by page-oriented recovery. The distributed variations are quite flexible and enable recovery to be a large-scale distributed service.
Page-oriented recovery leads to a tight coupling between the application, the buffer manager and the log manager. The tight coupling might be fine on a traditional single core machine, but it leads to performance issues when distributing the components to different machines and different cores. Segment-oriented recovery enables simpler and looser coupling among components. Write back caching reduces communication between the buffer manager and application, since the communication occurs only on cache eviction. Since there is no shared state, calls to the buffer manager and log manager can be asynchronous. The use of natural layouts for large objects allows DMA and zero-copy I/O in the local case. In the distributed case, this allows application data to be written without copying the data and the LSNs to the same machine. This allows for very flexible large-scale write-ahead logging as a service for cloud computing.
For small transactions, the networked version is roughly ten times slower than the local versions, but approximately 20 times faster than a distributed, page-oriented approach. As transaction sizes increase, segment-based recovery is better able to amortize network round trips due to log and buffer manager requests, and network throughput improves to more than 400 times that of the page-based approach. As above, the local versions of these benchmarks are competitive with local page-oriented approaches, especially for long transactions.

Analysis: The asynchronous approach is necessary for large-scale distributed systems. Also, recovery based on the granularity of application requests is more in line with the transactions in present large-scale Internet systems. However, the focus on large transaction size may not be valid for the cloud computing that this approach targets. The retrieval of a single web page can require communication with hundreds of small sub-services running on remote nodes. Perhaps some sort of lightweight version (we should consider letting go of some constraints if the application does not require them) would be preferable for the short transactions.

This up to date paper looking at modern challenges was well placed in the course
I hope we have more such updated papers throughout the course.

Tuesday, September 8, 2009

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging
ARIES attempts to better address the problem of guaranteeing atomicity and durability of transactions despite concurrent executions of multiple transactions and various failures. ARIES uses a log to record the progress of a transaction and its actions that change recoverable data. The method of logging allows partial or total rollbacks of transactions during both normal and restart processing. A lot of issues were discussed such as latches and locks, fine-granularity locking, and buffer management. The paper goes over the data structures used by ARIES, how ARIES processes transactions, restart processing, and recovery.
There are 3 phases in the Aries recovery algorithm:
-Analysis: Scan the log forward from the most recent checkpoint to identify all Xacts that were active, and all dirty pages in the buffer pool at the time of the crash.
-Redo: Redoes all updates to dirty pages in the buffer pool to ensure that all logged updates are carried out and written to disk.
-Undo: Working backwards in the log, the writes of all Xacts that were active at the crash are undone by restoring the before value of the update.
A crash occurring during the recovery process is also handled.
This paper discussed a detailed system for providing atomicity while still allowing us to have multiple transactions at once and recover from inevitable failures.

The HP AutoRAID Hierarchical Storage System

Management of redundant systems can be complicated. The HP AutoRAID attempts to provide a solution to this problem with a two-level storage hierarchy that is abstracted behind a single disk-array controller. Write-active data are mirrored for better performance while write-inactive data are stored in RAID 5 for better cost. The system begins by allocating mirrored space until no more can be stored this way and automatically reallocates some of the storage space to RAID 5. The HP AutoRAID adapts to the workload changes: newly active data are promoted to mirrored storage while data the have become less active are demoted to RAID 5. The HP AutoRAID also allows for easy disk upgrades of disks of different capacity, which allows for easy expansion and upgrade. The system can even have a second controller that it can automatically change over to in the case of failure of the primary controller. The system is simple to administer and setup. The HP AutoRAID system seems to do a good job of providing performance close to that of nonredundant systems for many workloads. At the same time, it provides full data redundancy and can tolerate failures of any single array component.
This system is a classic example of a system that does achieves its function while abstracting away how the function is carried out. Has this system become obsolete with the trend towards clusters of computers networked together? The cost of storage has gotten so cheap that RAID is not needed in such systems. Redundancy can be achieved by replicating data across multiple computers.

A History of Improving File System Performance

A Fast File System for UNIX
This paper begins by mentioning the types of applications, VLSI design and image processing, that do a small amount of processing on a large quantity of data and need to have a high throughput from the system. In the traditional UNIX file system, the inodes are not consecutively ordered which results in many disk accesses. To improve the file system throughput, a larger block size was used. But a larger block size results in a larger percentage of wasted space. The design team tried to counter this waste by putting small files in fragments within blocks. When this new file system was tested, there were order of magnitude increases in read and write bandwidth. The price that is paid is wasted disk space and increased CPU utilization. With CPU speed and disk size increasing exponentially and disk bandwidth lagging far behind, this seems like a desirable tradeoff.

The Design and Implementation of a Log-Structured File System
This paper reiterates the fact that processor speed is increasing exponentially, while memory and disk speeds are being left behind. While there are ways to increase disk throughput, there are no foreseeable improvements for access time. Information is spread around the disk, which causes too many disk accesses. So it seems that we should attempt to minimize disk accesses. A log-structured file system (LFS), which writes all modifications to disk sequentially, attempts to accomplish this goal.
A traditional ﬁle system achieves logical locality by assuming certain access patterns and pays extra on writes, to organize information optimally on disk for the assumed read patterns. In contrast, LFS achieves temporal locality: information that is created or modified at the same time will be grouped closely on disk. If temporal locality matches logical locality, as it does for a ﬁle that is written sequentially and then read sequentially, then a LFS should have about the same performance on large ﬁles as a traditional ﬁle system. If temporal locality differs from logical locality then the systems will perform differently: LFS will perform poorly on reads. This speeds up crash recovery because the system only needs to scan the most recent portion of the log, instead of the entire disk.
The main problem with LFS is degraded performance for logical reads, which is what journaling file systems try to address.

Analysis and Evolution of Journaling File Systems
This paper gives an overview of various journaling file systems in use today. Journaling tries to provide the best o both worlds: data is written to the log sequentially, then moved to is logical location. This provides fast crash recovery, while all normal reads occur from the standard blocks. Journaling is how modern commercial file systems work, such as Unix ext3 and Windows NTFS.

System R and Architecture of a DB Systems

System R
This paper talks about the initial origins of System R, goes through
the development of prototypes, and finally evaluates and draws
conclusions. The paper begins by talking about the relational data
model proposed by Codd in 1970. Codd proposed that data should be
represented by data values and never by any connections visible to the
user, and users should not have to specify algorithms when they make a
request. This is the original SQL DBMS. The question was whether the
second item could be achieved well. The paper continues by specifying
the key goals established for System R. The phase zero prototype was
just intended to determine the feasibility of design methods and
functions. Work then began on the phase one prototype. The paper
continues by talking about the details of the phase one prototype.
The evaluation of the phase one prototype was generally good and was
considered to have been successful in its goals: simplicity, power,
and independence of data. System R demonstrated that a production
database system was feasible. There is not much to criticize. This
is the foundation and inspiration for modern database systems. While
not as evolved as today’s production systems, the evolution has been a
progressive march over decades.

Architecture of a DB System
This is actually a good follow-up to the System R paper. System R
gives the original thinking when the first relational DBMS were being
developed. This paper goes over what has been learned/formalized over
the decades: process models, parallel architectures, ACID.
Essentially, this paper is an overview of various design
considerations that have come up in the history of DBMS.

I would be curious if we are going to discuss the implications of
today’s large scale systems. Not just your CAP theorem, but the
responses to these challenges (distributed file systems,
virtualization). Hopefully in a bit more detail than your syllabus
from last year. For example, with virtualization, could we look at
and discuss various approaches being considered to manage and spawn
VMs based upon need. What implications would this have with
distributed storage systems? It is not as simple as spawning VMs of a
stateless application layer.

Computer Systems Discussions

I am currently taking Eric Brewer's graduate computer systems class at Berkeley.
CS262a: Advanced Topics in Computer Systems
http://www.cs.berkeley.edu/~brewer/cs262/

Wednesday, August 26, 2009

Follow me on Twitter

http://twitter.com/peteper

Monday, August 24, 2009

Xerox PARC Cloud Computing Talk

Here is a talk I recently gave at PARC on cloud computing:

http://www.perlegos.com/cloud/PetePerlegos_CloudComputing.pdf

Wednesday, May 6, 2009

Virtualization vs. OS separation

There was a discussion point about virtualization software in class today:
There was a suggestion that we should advance OS research as opposed to using virtualization software due to the overhead involved. However, a return to basics may be in order. A review of the original IBM virtual machine paper may allow us to step back and get a better perspective.
Virtual storage and virtual machine concepts
A virtual machine is simply a machine sharing system. Over time, this has evolved into the OS we know today.

Virtualization software is the new datacenter OS:
-A strict separation between each cloud client is necessary to provide the illusion and security of running the application in isolation from other applications.
-A virtual machine (VM) isolating each application, which can scale to millions of user over many actual machines, is akin to the OS which isolates each application for one user.
-Virtualization software and support for it (by the CPU) should get to the point where there is no more overhead than using an OS to run multiple applications.
-Virtualization software must simply be a CPU/machine time/resource sharing software, but with a true wall between each VM.
-Communication between VMs on the same machine should be the same as communication between machines.
-Lightweight OS within a VM should be used to manage the processes involved with an application.

Paper comments:

Are Virtual Machine Monitors Microkernels Done Right

While I agreed with some of their high level thinking, their comments on communication of VMs stood out:

"hence synchronization and protected control transfer are only necessary when two virtual machines wish to explicitly communicate."

"where VMs do communicate, they may not only be written in separate programming languages, but may also be running completely different operating systems."

The problem with these comments is that communication with VMs running the same application should be the common case. This is necessary to achieve scalability. I wonder if their suggestion to separate control and data paths is sufficient considering that there may be a great deal of small scale communication between VMs.

Monday, May 4, 2009

Intel CERN whitepaper

While this paper is marketing material from Intel, it makes some good points about the value of moving to multi-core chips and using virtualization to consolidate servers with low utilization

Problem: Most businesses already spend about half as much for the electricity to power and cool their infrastructure as they do for the hardware itself, and this percentage is expected to increase. This challenge is compounded by the design constraints of existing data centers, many of which are already running at or near thermal capacity. Unless energy efficiency is dramatically improved, organizations will be unable to expand their computing infrastructure without the expense and disruption of upgrading their data center, building a new one, or migrating to a co-location facility.
The goal was to maximize total performance per Watt for the computing infrastructure. This can allow datacenters to grow their computing, reduce their costs and extend the life of existing facilities.

Solution: CERN has found that its return on investment (ROI) is generally highest by optimizing datacenter performance/Watt. Multi-core processors based on the Intel Core microarchitecture deliver about five times more compute power per Watt than single-core processors based on the earlier Intel NetBurst microarchitecture. According to CERN, this move alone has already increased the useful life of its data center by about two years, enabling the organization to avoid the cost and disruption of adding a new facility.

This energy efficiency can be achieved with a basic understanding of circuit design. A small reduction in frequency causes a small reduction in the amount of work performed, but a relatively large drop in the amount of energy consumed. As a result, more cores running at lower frequencies can deliver substantial gains in total performance per Watt.

Virtualization can be used to consolidate smaller and infrequently used applications, which reduces the number of servers required for these secondary workloads. Better utilization provides energy efficiency gains, reduces data center footprints, and provides a more flexible and manageable infrastructure.

Conclusion: This paper brings up a couple of good points. While new datacenters may not have to optimize power consumption to the extreme of CERN, they should be aware of the long term consequences of their decisions, which may create limitations in the future. Also, while virtualization adds overhead, and is not preferred for situation where there is high server utilization for a particular application, virtualization can provide clear benefits when used to consolidate underutilized servers.

Microsoft PUE: Parts 1&2

Problem: Given the complexity of datacenter design and operation, energy efficiency changes must be closely monitored for overall effect. You can upgrade items to more energy-efficient equivalents, but unless you look at the big picture and understand how the pieces fit together, you could end up being disappointed with the outcome.

Solution: PUE is suggested as the indicator of whether efficiency actually got better or worse. PUE = (Total Facility Power)/(IT Equipment Power)
PUE is a simple metric to get a big picture view of datacenter efficiency design and cost effectiveness. Without a metric like PUE, the authors suggest that the engineer could not measure the datacenter efficiency to see if it had improved.

Analysis: Total datacenter costs must be considered: servers, infrastructure, running costs (energy, management, etc). In 2001, the sum of infrastructure and energy costs was equal to the cost of a 1U server. In 2004, the infrastructure cost alone was equal to the cost of the server. In 2008, just the energy cost was equal to the cost of a server.
PUE is a simple metric that analyzes datacenter efficiency in terms of how much overhead there is for a given set of servers in the datacenter. However, PUE neglects an analysis of the actual servers in the datacenter. Work/Watt is an important metric that cannot be neglected. It is likely that servers can be more easily upgraded than datacenter infrastructure.

Conclusion: PUE is a useful metric to analyze the overhead beyond the servers but is not the only necessary metric. Work/Watt is important. The power used by computer equipment to accomplish the work/computation is the central datacenter energy efficiency issue.

Wednesday, April 29, 2009

Ubuntu 9.04 beta out, now with fresh Eucalyptus

http://mobile.pcauthority.com.au/Article.aspx?CIID=141212&type=News&page=0&showall=true

Cloud Economics

Problem: Economic allocation of cloud resources within a data center.

Kevin Lai from HP Labs came to discuss his work in this field. Kevin brought up some interesting point and I will summarize two: Optimization should be done across multiple layers and Using a bidding system to optimize provisioning of resources.

http://tycoon-wiki.hpl.hp.com/

Optimization across multiple layers
Kevin argues that optimization cannot merely be done in each component: Application, Distributed storage/computing, Virtualization/Network/OS, Physical/Power. An overarching optimization must be done across these components with each layer coordinating with a general optimizer. However, abstraction has allowed for a reduction of complexity and innovation. Google is an excellent example. Many applications and services for applications (BigTable, Chubby) are built upon GFS. This modular architecture was built up over time as needed. There is also a great deal of innovation occurring in the open source space (Hadoop, etc). While overarching optimization may occur in the future, at this time, it may stifle innovation by preventing changes to a given component.

Bidding System
There is an argument to be made that some sort of bidding system may help mitigate supply and demand issues between cloud providers and customers. However, some customers may want cost guarantees provided by a flat rate per use.
The most interesting aspect of Kevin's proposals are using a bidding system to allocate resources within the data center. Such a bidding system can be used to create a predictability model which can tradeoff a bid for resources, QOS, and a guarantee (probability of completion). On average, this model can allow jobs to complete more work within a given time.
This bidding system can also be used to provision resources. Price inflation for a given resource is an indication that there is under provisioning.

Monday, April 27, 2009

BOINC

Problem: Large scientific computations require many computers in order to effectively process the data.

Solution: BOINC is a software system that makes it easy for scientists to create and operate public-resource computing projects. This allows researchers to distribute the computation to millions of regular computer users through a client-side platform that runs when the computer is idle. The paper describes features that went into BOINC to make it easy for scientists and attractive for users. Incentives for users are a nice interface and a count of how much they are contributing to the project (Top Alien Hunters).

Influence: I remember running SETI@home 24/7 in the background on my P3 600MHz overclocked to 800MHz.
Main project website: http://boinc.berkeley.edu/
Listen for ET’s call: http://setiathome.berkeley.edu/
Find gravity waves: http://einstein.phys.uwm.edu/

BitTorrent

Problem: Distributing a large files to multiple users can overload the resources of a hosting machine.

Solution: BitTorrent is a peer-to-peer protocol that lets peers download pieces of the file from various peers and upload other pieces to them. This redistributes the cost of upload to the downloaders. Some of the challenges include figuring out which peers have what parts of the file and high churn rates. The rarest first technique and the choking and unchoking algorithms allow BitTorrent efficiency to mitigate the challenges.
Rarest First: Replicating the rarest pieces as quickly as possible reduces the risk of them getting completely lost as current peers stop uploading. It also makes sure that pieces which are more common are left for later, so the likelihood that a peer which currently is offering upload will later not have anything of interest is reduced.
Chocking/Unchocking: A good choking algorithm should utilize all available resources, provide reasonably consistent download rates for everyone, and be somewhat resistant to peers only downloading and not uploading. To avoid situations in which resources are wasted by rapidly choking and unchoking peers, BitTorrent peers recalculate who they want to choke once every ten seconds, and then leave the situation as is until the next ten second period is up. Ten seconds is a long enough period of time for TCP to ramp up new transfers to their full capacity.

Analysis: If there are few leechers, BitTorrent works very well. However, if a substantial number of peers tried to download without uploading, there would be problems. Not many users are very abusive today which is a real world demonstration of the prisoner’s dilemma.

Measuring and Evaluating Large-Scale CDNs

Problem: CDNs play a critical and central part of today’s Internet infrastructure. There is a need to measure the performance of large-scale commercial CDNs. Most major websites that use CDNs would be interested in the results. While CDNs have internal performance metrics they present to potential customers, this study is done from the outside, which may give websites more information on which to base their decisions.

Solution: