Wednesday, May 6, 2009

Virtualization vs. OS separation

There was a discussion point about virtualization software in class today:
There was a suggestion that we should advance OS research as opposed to using virtualization software due to the overhead involved. However, a return to basics may be in order. A review of the original IBM virtual machine paper may allow us to step back and get a better perspective.
Virtual storage and virtual machine concepts
A virtual machine is simply a machine sharing system. Over time, this has evolved into the OS we know today.

Virtualization software is the new datacenter OS:
-A strict separation between each cloud client is necessary to provide the illusion and security of running the application in isolation from other applications.
-A virtual machine (VM) isolating each application, which can scale to millions of user over many actual machines, is akin to the OS which isolates each application for one user.
-Virtualization software and support for it (by the CPU) should get to the point where there is no more overhead than using an OS to run multiple applications.
-Virtualization software must simply be a CPU/machine time/resource sharing software, but with a true wall between each VM.
-Communication between VMs on the same machine should be the same as communication between machines.
-Lightweight OS within a VM should be used to manage the processes involved with an application.


Paper comments:


While I agreed with some of their high level thinking, their comments on communication of VMs stood out:
"hence synchronization and protected control transfer are only necessary when two virtual machines wish to explicitly communicate."
"where VMs do communicate, they may not only be written in separate programming languages, but may also be running completely different operating systems."

The problem with these comments is that communication with VMs running the same application should be the common case. This is necessary to achieve scalability. I wonder if their suggestion to separate control and data paths is sufficient considering that there may be a great deal of small scale communication between VMs.

Monday, May 4, 2009

Intel CERN whitepaper

While this paper is marketing material from Intel, it makes some good points about the value of moving to multi-core chips and using virtualization to consolidate servers with low utilization

Problem: Most businesses already spend about half as much for the electricity to power and cool their infrastructure as they do for the hardware itself, and this percentage is expected to increase. This challenge is compounded by the design constraints of existing data centers, many of which are already running at or near thermal capacity. Unless energy efficiency is dramatically improved, organizations will be unable to expand their computing infrastructure without the expense and disruption of upgrading their data center, building a new one, or migrating to a co-location facility.
The goal was to maximize total performance per Watt for the computing infrastructure. This can allow datacenters to grow their computing, reduce their costs and extend the life of existing facilities.

Solution: CERN has found that its return on investment (ROI) is generally highest by optimizing datacenter performance/Watt. Multi-core processors based on the Intel Core microarchitecture deliver about five times more compute power per Watt than single-core processors based on the earlier Intel NetBurst microarchitecture. According to CERN, this move alone has already increased the useful life of its data center by about two years, enabling the organization to avoid the cost and disruption of adding a new facility.

This energy efficiency can be achieved with a basic understanding of circuit design. A small reduction in frequency causes a small reduction in the amount of work performed, but a relatively large drop in the amount of energy consumed. As a result, more cores running at lower frequencies can deliver substantial gains in total performance per Watt.

Virtualization can be used to consolidate smaller and infrequently used applications, which reduces the number of servers required for these secondary workloads. Better utilization provides energy efficiency gains, reduces data center footprints, and provides a more flexible and manageable infrastructure.

Conclusion: This paper brings up a couple of good points. While new datacenters may not have to optimize power consumption to the extreme of CERN, they should be aware of the long term consequences of their decisions, which may create limitations in the future. Also, while virtualization adds overhead, and is not preferred for situation where there is high server utilization for a particular application, virtualization can provide clear benefits when used to consolidate underutilized servers.

Microsoft PUE: Parts 1&2

Problem: Given the complexity of datacenter design and operation, energy efficiency changes must be closely monitored for overall effect. You can upgrade items to more energy-efficient equivalents, but unless you look at the big picture and understand how the pieces fit together, you could end up being disappointed with the outcome.

Solution: PUE is suggested as the indicator of whether efficiency actually got better or worse. PUE = (Total Facility Power)/(IT Equipment Power)
PUE is a simple metric to get a big picture view of datacenter efficiency design and cost effectiveness. Without a metric like PUE, the authors suggest that the engineer could not measure the datacenter efficiency to see if it had improved.

Analysis: Total datacenter costs must be considered: servers, infrastructure, running costs (energy, management, etc). In 2001, the sum of infrastructure and energy costs was equal to the cost of a 1U server. In 2004, the infrastructure cost alone was equal to the cost of the server. In 2008, just the energy cost was equal to the cost of a server.
PUE is a simple metric that analyzes datacenter efficiency in terms of how much overhead there is for a given set of servers in the datacenter. However, PUE neglects an analysis of the actual servers in the datacenter. Work/Watt is an important metric that cannot be neglected. It is likely that servers can be more easily upgraded than datacenter infrastructure.

Conclusion: PUE is a useful metric to analyze the overhead beyond the servers but is not the only necessary metric. Work/Watt is important. The power used by computer equipment to accomplish the work/computation is the central datacenter energy efficiency issue.