Monday, April 25, 2016

Once again Fiat Lux

I will be posting again.

To that end, here are some updates:


My LinkedIn:

Wednesday, October 29, 2014

Fiat Lux

I will be posting again. To that end, here are some updates:


My LinkedIn:

My startup:

Monday, August 22, 2011

HP has problems with it renewed focus on enterprise

This article made a good point about the problems HP had this weekend with it's the mad rush to buy Touchpads.
HP's TouchPad fire sale: The fallout
HP claims to sell "adaptive infrastructure." But after seeing HP's infrastructure at work on it's own website, would you buy HP's cloud technology?

Tuesday, December 7, 2010

At Dreamforce in SF

I am at Dreamforce in SF this week.

My blog is active again...

Saturday, October 24, 2009

Monday, October 19, 2009

Xen and the Art of Virtualization

Modern computers are sufficiently powerful to use virtualization to present the illusion of many smaller virtual machines (VMs), each running a separate operating system instance. Xen uses paravirtualization for the Virtual Machine monitor (VMM). A VMM is the layer between software and hardware, allowing more than one guest OS to run concurrently. Xen shows good performance and isolation of each VM, while allowing unmodified applications and supporting full operating systems.

When Xen began its development, it could not support a guest OS "as is", and patches to the guest kernel had to be applied. Today, architecture modifications allow unmodified kernels to run over Xen. The IA32 architecture uses a 4 ring privileges model, where the highest privilege (usually the OS, but now the Xen VMM) is running in ring 0, the rings 1 (now running the OS) and 2 are mostly unused, and ring 3 is home for the applications.

The Xen architecture is as follows:
-The hypercall interface allows domains to perform asynchronous software trap into the hypervisor to perform a privileged operation, analogous to the use of system calls in conventional operating systems.
-Xen allows guests to manage their own page tables (PTs), but limits the direct access to it to read-only. Modifying the PT is permitted with a Xen verification of the change.
-The hardware interrupt system is replaced by a lightweight event system.
-Xen contains no device drivers. Each driver request forwarded by a guest OS, is redirected by Xen to the Host OS device driver, allowing the guest OS to interact with the hardware while being hardware-independent.

The scalability provided by Xen's lightweight hypervisor allows for many VMs on each physical machine, which allows for efficient, fine-grained, and dynamic management of VMs. Xen’s network performance is important due to the need for multiple VMs running on separate physical machines. For example, an application’s need for 1000 machines may be serviced by 4000 VMs running on a quarter machine.

Please see a previous post for additional comments:
Virtualization vs. OS Separation

Wednesday, October 7, 2009

Parallel Database Systems

Parallel Database Systems: The Future of High Performance Database Systems
by David DeWitt and Jim Gray

I remember reading this paper years ago. Reading it again next to having read the MapReduce paper put the parallel DB mindset into a modern view. Anyone who has read the MapReduce paper should read this article.
MapReduce summary

These papers give perspective on how the problem can expand and be attacked more than a decade into the future. The Parallel DB paper goes over some barriers to speedup: startup, interference, skew. The scale of the problem and needed parallelism has increased dramatically since the parallel DB article was written (multi-PB vs multi-TB). There was also great foresight in this paper concerning the use of local computation on commodity processors and memory (shared-nothing architecture).

MapReduce has attempted to formalize and attach these problems and remove them from the input given by the requester. MapReduce can be thought of as a type of parallel SQL. MapReduce attacks each of the barriers to speedup: (see MapReduce summary)
Startup: MapReduce has streamlined a general purpose.
Interference: The MapReduce workers are distributed evenly enough and as close as possible to the data that must be operated upon.
Skew: MapReduce deals with the straggler program by restarting slow jobs.

Analysis: The Parallel DB paper tried to look ahead at the evolution of high performance DB systems. The authors did not, and could not have foreseen the exponential expansion of the internet. The scale of demand upon DB systems requires a change in mindset. Traditionally, the speed of RDMS has been increased in the background while maintaining all constraints. Today, we accept that there fundamental tradeoffs (see CAP Theorem). By loosening some of the constraints (such as strict consistency), we can achieve the needed performance provided by modern distributed storage systems.

Monday, October 5, 2009

Lottery Scheduling: Flexible Proportional-Share Resource Management

Lottery Scheduling: Flexible Proportional-Share Resource Management
Carl A. Waldspurger and William E. Weihl
I implemented lottery scheduling in my CS162 project. Lottery scheduling is a really simple and elegant idea. The ability to support a great deal of threads created a need for a better proportional-share/priority scheduling algorithm. Lottery scheduling works as follows: give each job some number of lottery tickets, on each time slice randomly pick a winning ticket, CPU time is proportional to number of tickets given to each job (on average). Lottery scheduling is probabilistically fair and behaves gracefully as load changes. Increasing the number of tickets also reduces the mean waiting times for each thread and gets closer to ideal fairness. There is even a form of priority donation that is accomplished by transferring a number of tickets to a client that that is blocking. The lottery scheduling mechanism achieves this fairness with relatively little overhead. A tree based lottery only needs to generate a random number and perform log(n) additions and comparisons to select a winner among n clients. Lotteries can also be used to manage many diverse resources: processor time, I/O bandwidth, and access locks. Lottery scheduling can even be used for scheduling communication resources.

Analysis: Lottery scheduling offers probabilistic guarantees for throughput and response time. This results in increased expected error (variability) in as the number of allocations increases. The Lottery Scheduling authors responded with Stride Scheduling to limit the variability and make the error independent of the number of allocations. Stride scheduling still uses tickets to determine proportional throughput among clients.

Monday, September 28, 2009

On Optimistic Methods for Concurrency Control

In this paper, the authors present two methods of non-locking concurrency controls for DBMS: serial and parallel validation. These methods are “optimistic” because they rely on the hope that conflicts will not occur. The authors first make it clear that locking approaches to concurrency control have numerous disadvantages (locking overhead, deadlock). The argument for “optimistic” concurrency control is as follows: reads are completely unrestricted since they can never cause a loss of integrity, and writes are restricted. A transaction consists of two or three phases: a read phase, a validation phase, and a possible write phase. These methods may be superior to locking methods for systems where transaction conflict is highly unlikely, such as query-dominant systems and very large tree indexes. Such an optimistic system would be inefficient where the transaction conflict is not rare. The paper suggests that a system should vary the amount of locking versus optimistic approaches as the likelihood of transaction conflict in the system varies. However, the authors do not delve into such an analysis.

Granularity of Locks and Degrees of Consistency in a Shared Data Base

This paper discusses the granularity of locks in a data base management system: small granularity means high concurrency but high overhead, large granularity means low overhead but low concurrency. A fine lockable system is preferable for a simple transaction which accesses few records and a coarse lockable system is preferable for a complex transaction which accesses many records. Therefore, it is favorable to have lockable units of different granularities coexisting in the same system. This can be accomplished with the use of hierarchical locks, which allow one to lock a node and all of its descendants. Various access modes can be used to ensure that two lock requests by two different transactions are compatible (can be granted concurrently). These ideas can be generalized to work for directed acyclic graphs (DAG), which are really just more general than trees. We also have to remember that indices and files are created and destroyed continually and so the lock graphs are dynamic. While there are issues that must be considered to ensure correctness of this added complication to a DBMS, it is worthwhile due to the need for the performance gain that will result.

Tuesday, September 22, 2009

Experience With Processes and Monitors in Mesa

This paper discusses how to deal with threads (processes running concurrently within a process). This is the period 30 years ago when concurrency was increasing in programs. This paper tried to tackle the issues involved with using monitors to synchronize these threads. The authors chose to synchronize processes with preemptive scheduling of light-weight processes and monitors: need to do I/O, allow for multiprocessors, allow information hiding between modules, more structured locking.

This paper made an attempt to work out the subtle issues involved in implementing a system running many light-weight processes concurrently. Today, the layer of the system that allows for such concurrency functions effectively and allows us to develop many complex systems with very large concurrency.

Wednesday, September 16, 2009

Lightweight Recoverable Virtual Memory

Goal: How simple can a transactional facility be, while remaining a potent tool for fault-tolerance? The authors omitted what they could without crippling RVM.
RVM is intended for Unix applications with persistent data and the paper is presented in three parts: rationale, architecture, and implementation.

Rational: Existing solutions, such as Camelot, were too heavy-weight. The solution was a light version that provides only recoverable virtual memory. The central principle the authors adopted in designing RVM was to value simplicity over generality. The authors followed to systems principle of building simple components that do things well.

Architecture: LRVM is managed in segments as opposed to pages. Regions of the segments are mapped to Unix virtual memory. There is no aliasing: mappings cannot overlap in virtual memory and no region of a segment may be mapped more than once by the same process. Regions can be unmapped at any time, as long as they have no uncommitted transactions outstanding. The designers focused on using meta-data to minimize the disk space that recoverable memory would occupy. Many other small features were also omitted.

Log Management: No undo/redo since value logging strategy because it never reflects uncommitted changes to an external data segment. Crash recovery is idempotent by being done last. Log truncation is also handled.
Optimizations: Intra-transaction optimizations arise when set-range calls specifying identical, overlapping, or adjacent memory addresses are issued within a single transaction. Inter-transaction optimizations occur only in the context of no-flush transactions.

Result: RVM handles more transaction/sec than Camelot in all cases. However, the performance significantly drops when recoverable memory size approaches physical memory size.

Analysis: A lightweight RVM can provide the necessary functionally, while improving performance and modularity. This lesson can be applied in general when building a complex system. Additional features can be added later as components built on top of the system. While this paper did not demonstrate this idea, complex distributed systems today, such as Google, are built with layered component blocks providing a given functionality.

Tuesday, September 15, 2009

Segment-Based Recovery: Write-ahead logging revisited

Large-scale distributed systems have challenges that are not adequately addressed by previous recovery systems, or previous architectures based on single systems in general. This paper proposes segment-oriented recovery to enable distributed recovery architectures that are not hindered by the tight coupling of components required by page-oriented recovery. The distributed variations are quite flexible and enable recovery to be a large-scale distributed service.
Page-oriented recovery leads to a tight coupling between the application, the buffer manager and the log manager. The tight coupling might be fine on a traditional single core machine, but it leads to performance issues when distributing the components to different machines and different cores. Segment-oriented recovery enables simpler and looser coupling among components. Write back caching reduces communication between the buffer manager and application, since the communication occurs only on cache eviction. Since there is no shared state, calls to the buffer manager and log manager can be asynchronous. The use of natural layouts for large objects allows DMA and zero-copy I/O in the local case. In the distributed case, this allows application data to be written without copying the data and the LSNs to the same machine. This allows for very flexible large-scale write-ahead logging as a service for cloud computing.
For small transactions, the networked version is roughly ten times slower than the local versions, but approximately 20 times faster than a distributed, page-oriented approach. As transaction sizes increase, segment-based recovery is better able to amortize network round trips due to log and buffer manager requests, and network throughput improves to more than 400 times that of the page-based approach. As above, the local versions of these benchmarks are competitive with local page-oriented approaches, especially for long transactions.

Analysis: The asynchronous approach is necessary for large-scale distributed systems. Also, recovery based on the granularity of application requests is more in line with the transactions in present large-scale Internet systems. However, the focus on large transaction size may not be valid for the cloud computing that this approach targets. The retrieval of a single web page can require communication with hundreds of small sub-services running on remote nodes. Perhaps some sort of lightweight version (we should consider letting go of some constraints if the application does not require them) would be preferable for the short transactions.

This up to date paper looking at modern challenges was well placed in the course
I hope we have more such updated papers throughout the course.

Tuesday, September 8, 2009

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging
ARIES attempts to better address the problem of guaranteeing atomicity and durability of transactions despite concurrent executions of multiple transactions and various failures. ARIES uses a log to record the progress of a transaction and its actions that change recoverable data. The method of logging allows partial or total rollbacks of transactions during both normal and restart processing. A lot of issues were discussed such as latches and locks, fine-granularity locking, and buffer management. The paper goes over the data structures used by ARIES, how ARIES processes transactions, restart processing, and recovery.
There are 3 phases in the Aries recovery algorithm:
-Analysis: Scan the log forward from the most recent checkpoint to identify all Xacts that were active, and all dirty pages in the buffer pool at the time of the crash.
-Redo: Redoes all updates to dirty pages in the buffer pool to ensure that all logged updates are carried out and written to disk.
-Undo: Working backwards in the log, the writes of all Xacts that were active at the crash are undone by restoring the before value of the update.
A crash occurring during the recovery process is also handled.
This paper discussed a detailed system for providing atomicity while still allowing us to have multiple transactions at once and recover from inevitable failures.

The HP AutoRAID Hierarchical Storage System

Management of redundant systems can be complicated. The HP AutoRAID attempts to provide a solution to this problem with a two-level storage hierarchy that is abstracted behind a single disk-array controller. Write-active data are mirrored for better performance while write-inactive data are stored in RAID 5 for better cost. The system begins by allocating mirrored space until no more can be stored this way and automatically reallocates some of the storage space to RAID 5. The HP AutoRAID adapts to the workload changes: newly active data are promoted to mirrored storage while data the have become less active are demoted to RAID 5. The HP AutoRAID also allows for easy disk upgrades of disks of different capacity, which allows for easy expansion and upgrade. The system can even have a second controller that it can automatically change over to in the case of failure of the primary controller. The system is simple to administer and setup. The HP AutoRAID system seems to do a good job of providing performance close to that of nonredundant systems for many workloads. At the same time, it provides full data redundancy and can tolerate failures of any single array component.
This system is a classic example of a system that does achieves its function while abstracting away how the function is carried out. Has this system become obsolete with the trend towards clusters of computers networked together? The cost of storage has gotten so cheap that RAID is not needed in such systems. Redundancy can be achieved by replicating data across multiple computers.

A History of Improving File System Performance

A Fast File System for UNIX
This paper begins by mentioning the types of applications, VLSI design and image processing, that do a small amount of processing on a large quantity of data and need to have a high throughput from the system. In the traditional UNIX file system, the inodes are not consecutively ordered which results in many disk accesses. To improve the file system throughput, a larger block size was used. But a larger block size results in a larger percentage of wasted space. The design team tried to counter this waste by putting small files in fragments within blocks. When this new file system was tested, there were order of magnitude increases in read and write bandwidth. The price that is paid is wasted disk space and increased CPU utilization. With CPU speed and disk size increasing exponentially and disk bandwidth lagging far behind, this seems like a desirable tradeoff.

The Design and Implementation of a Log-Structured File System
This paper reiterates the fact that processor speed is increasing exponentially, while memory and disk speeds are being left behind. While there are ways to increase disk throughput, there are no foreseeable improvements for access time. Information is spread around the disk, which causes too many disk accesses. So it seems that we should attempt to minimize disk accesses. A log-structured file system (LFS), which writes all modifications to disk sequentially, attempts to accomplish this goal.
A traditional file system achieves logical locality by assuming certain access patterns and pays extra on writes, to organize information optimally on disk for the assumed read patterns. In contrast, LFS achieves temporal locality: information that is created or modified at the same time will be grouped closely on disk. If temporal locality matches logical locality, as it does for a file that is written sequentially and then read sequentially, then a LFS should have about the same performance on large files as a traditional file system. If temporal locality differs from logical locality then the systems will perform differently: LFS will perform poorly on reads. This speeds up crash recovery because the system only needs to scan the most recent portion of the log, instead of the entire disk.
The main problem with LFS is degraded performance for logical reads, which is what journaling file systems try to address.

Analysis and Evolution of Journaling File Systems
This paper gives an overview of various journaling file systems in use today. Journaling tries to provide the best o both worlds: data is written to the log sequentially, then moved to is logical location. This provides fast crash recovery, while all normal reads occur from the standard blocks. Journaling is how modern commercial file systems work, such as Unix ext3 and Windows NTFS.

System R and Architecture of a DB Systems

System R
This paper talks about the initial origins of System R, goes through
the development of prototypes, and finally evaluates and draws
conclusions. The paper begins by talking about the relational data
model proposed by Codd in 1970. Codd proposed that data should be
represented by data values and never by any connections visible to the
user, and users should not have to specify algorithms when they make a
request. This is the original SQL DBMS. The question was whether the
second item could be achieved well. The paper continues by specifying
the key goals established for System R. The phase zero prototype was
just intended to determine the feasibility of design methods and
functions. Work then began on the phase one prototype. The paper
continues by talking about the details of the phase one prototype.
The evaluation of the phase one prototype was generally good and was
considered to have been successful in its goals: simplicity, power,
and independence of data. System R demonstrated that a production
database system was feasible. There is not much to criticize. This
is the foundation and inspiration for modern database systems. While
not as evolved as today’s production systems, the evolution has been a
progressive march over decades.

Architecture of a DB System
This is actually a good follow-up to the System R paper. System R
gives the original thinking when the first relational DBMS were being
developed. This paper goes over what has been learned/formalized over
the decades: process models, parallel architectures, ACID.
Essentially, this paper is an overview of various design
considerations that have come up in the history of DBMS.

I would be curious if we are going to discuss the implications of
today’s large scale systems. Not just your CAP theorem, but the
responses to these challenges (distributed file systems,
virtualization). Hopefully in a bit more detail than your syllabus
from last year. For example, with virtualization, could we look at
and discuss various approaches being considered to manage and spawn
VMs based upon need. What implications would this have with
distributed storage systems? It is not as simple as spawning VMs of a
stateless application layer.

Computer Systems Discussions

I am currently taking Eric Brewer's graduate computer systems class at Berkeley.
CS262a: Advanced Topics in Computer Systems