Saturday, October 24, 2009

Monday, October 19, 2009

Xen and the Art of Virtualization

Modern computers are sufficiently powerful to use virtualization to present the illusion of many smaller virtual machines (VMs), each running a separate operating system instance. Xen uses paravirtualization for the Virtual Machine monitor (VMM). A VMM is the layer between software and hardware, allowing more than one guest OS to run concurrently. Xen shows good performance and isolation of each VM, while allowing unmodified applications and supporting full operating systems.

When Xen began its development, it could not support a guest OS "as is", and patches to the guest kernel had to be applied. Today, architecture modifications allow unmodified kernels to run over Xen. The IA32 architecture uses a 4 ring privileges model, where the highest privilege (usually the OS, but now the Xen VMM) is running in ring 0, the rings 1 (now running the OS) and 2 are mostly unused, and ring 3 is home for the applications.

The Xen architecture is as follows:
-The hypercall interface allows domains to perform asynchronous software trap into the hypervisor to perform a privileged operation, analogous to the use of system calls in conventional operating systems.
-Xen allows guests to manage their own page tables (PTs), but limits the direct access to it to read-only. Modifying the PT is permitted with a Xen verification of the change.
-The hardware interrupt system is replaced by a lightweight event system.
-Xen contains no device drivers. Each driver request forwarded by a guest OS, is redirected by Xen to the Host OS device driver, allowing the guest OS to interact with the hardware while being hardware-independent.

The scalability provided by Xen's lightweight hypervisor allows for many VMs on each physical machine, which allows for efficient, fine-grained, and dynamic management of VMs. Xen’s network performance is important due to the need for multiple VMs running on separate physical machines. For example, an application’s need for 1000 machines may be serviced by 4000 VMs running on a quarter machine.

Please see a previous post for additional comments:
Virtualization vs. OS Separation

Wednesday, October 7, 2009

Parallel Database Systems

Parallel Database Systems: The Future of High Performance Database Systems
by David DeWitt and Jim Gray

I remember reading this paper years ago. Reading it again next to having read the MapReduce paper put the parallel DB mindset into a modern view. Anyone who has read the MapReduce paper should read this article.
MapReduce summary

These papers give perspective on how the problem can expand and be attacked more than a decade into the future. The Parallel DB paper goes over some barriers to speedup: startup, interference, skew. The scale of the problem and needed parallelism has increased dramatically since the parallel DB article was written (multi-PB vs multi-TB). There was also great foresight in this paper concerning the use of local computation on commodity processors and memory (shared-nothing architecture).

MapReduce has attempted to formalize and attach these problems and remove them from the input given by the requester. MapReduce can be thought of as a type of parallel SQL. MapReduce attacks each of the barriers to speedup: (see MapReduce summary)
Startup: MapReduce has streamlined a general purpose.
Interference: The MapReduce workers are distributed evenly enough and as close as possible to the data that must be operated upon.
Skew: MapReduce deals with the straggler program by restarting slow jobs.

Analysis: The Parallel DB paper tried to look ahead at the evolution of high performance DB systems. The authors did not, and could not have foreseen the exponential expansion of the internet. The scale of demand upon DB systems requires a change in mindset. Traditionally, the speed of RDMS has been increased in the background while maintaining all constraints. Today, we accept that there fundamental tradeoffs (see CAP Theorem). By loosening some of the constraints (such as strict consistency), we can achieve the needed performance provided by modern distributed storage systems.

Monday, October 5, 2009

Lottery Scheduling: Flexible Proportional-Share Resource Management

Lottery Scheduling: Flexible Proportional-Share Resource Management
Carl A. Waldspurger and William E. Weihl
I implemented lottery scheduling in my CS162 project. Lottery scheduling is a really simple and elegant idea. The ability to support a great deal of threads created a need for a better proportional-share/priority scheduling algorithm. Lottery scheduling works as follows: give each job some number of lottery tickets, on each time slice randomly pick a winning ticket, CPU time is proportional to number of tickets given to each job (on average). Lottery scheduling is probabilistically fair and behaves gracefully as load changes. Increasing the number of tickets also reduces the mean waiting times for each thread and gets closer to ideal fairness. There is even a form of priority donation that is accomplished by transferring a number of tickets to a client that that is blocking. The lottery scheduling mechanism achieves this fairness with relatively little overhead. A tree based lottery only needs to generate a random number and perform log(n) additions and comparisons to select a winner among n clients. Lotteries can also be used to manage many diverse resources: processor time, I/O bandwidth, and access locks. Lottery scheduling can even be used for scheduling communication resources.

Analysis: Lottery scheduling offers probabilistic guarantees for throughput and response time. This results in increased expected error (variability) in as the number of allocations increases. The Lottery Scheduling authors responded with Stride Scheduling to limit the variability and make the error independent of the number of allocations. Stride scheduling still uses tickets to determine proportional throughput among clients.