Wednesday, April 29, 2009

Ubuntu 9.04 beta out, now with fresh Eucalyptus

Cloud Economics

Problem: Economic allocation of cloud resources within a data center.

Kevin Lai from HP Labs came to discuss his work in this field. Kevin brought up some interesting point and I will summarize two: Optimization should be done across multiple layers and Using a bidding system to optimize provisioning of resources.

Optimization across multiple layers
Kevin argues that optimization cannot merely be done in each component: Application, Distributed storage/computing, Virtualization/Network/OS, Physical/Power. An overarching optimization must be done across these components with each layer coordinating with a general optimizer. However, abstraction has allowed for a reduction of complexity and innovation. Google is an excellent example. Many applications and services for applications (BigTable, Chubby) are built upon GFS. This modular architecture was built up over time as needed. There is also a great deal of innovation occurring in the open source space (Hadoop, etc). While overarching optimization may occur in the future, at this time, it may stifle innovation by preventing changes to a given component.

Bidding System
There is an argument to be made that some sort of bidding system may help mitigate supply and demand issues between cloud providers and customers. However, some customers may want cost guarantees provided by a flat rate per use.
The most interesting aspect of Kevin's proposals are using a bidding system to allocate resources within the data center. Such a bidding system can be used to create a predictability model which can tradeoff a bid for resources, QOS, and a guarantee (probability of completion). On average, this model can allow jobs to complete more work within a given time.
This bidding system can also be used to provision resources. Price inflation for a given resource is an indication that there is under provisioning.

Monday, April 27, 2009


Problem: Large scientific computations require many computers in order to effectively process the data.

Solution: BOINC is a software system that makes it easy for scientists to create and operate public-resource computing projects. This allows researchers to distribute the computation to millions of regular computer users through a client-side platform that runs when the computer is idle. The paper describes features that went into BOINC to make it easy for scientists and attractive for users. Incentives for users are a nice interface and a count of how much they are contributing to the project (Top Alien Hunters).

Influence: I remember running SETI@home 24/7 in the background on my P3 600MHz overclocked to 800MHz.
Main project website:
Listen for ET’s call:
Find gravity waves:


Problem: Distributing a large files to multiple users can overload the resources of a hosting machine.

Solution: BitTorrent is a peer-to-peer protocol that lets peers download pieces of the file from various peers and upload other pieces to them. This redistributes the cost of upload to the downloaders. Some of the challenges include figuring out which peers have what parts of the file and high churn rates. The rarest first technique and the choking and unchoking algorithms allow BitTorrent efficiency to mitigate the challenges.
Rarest First: Replicating the rarest pieces as quickly as possible reduces the risk of them getting completely lost as current peers stop uploading. It also makes sure that pieces which are more common are left for later, so the likelihood that a peer which currently is offering upload will later not have anything of interest is reduced.
Chocking/Unchocking: A good choking algorithm should utilize all available resources, provide reasonably consistent download rates for everyone, and be somewhat resistant to peers only downloading and not uploading. To avoid situations in which resources are wasted by rapidly choking and unchoking peers, BitTorrent peers recalculate who they want to choke once every ten seconds, and then leave the situation as is until the next ten second period is up. Ten seconds is a long enough period of time for TCP to ramp up new transfers to their full capacity.

Analysis: If there are few leechers, BitTorrent works very well. However, if a substantial number of peers tried to download without uploading, there would be problems. Not many users are very abusive today which is a real world demonstration of the prisoner’s dilemma.

Measuring and Evaluating Large-Scale CDNs

Problem: CDNs play a critical and central part of today’s Internet infrastructure. There is a need to measure the performance of large-scale commercial CDNs. Most major websites that use CDNs would be interested in the results. While CDNs have internal performance metrics they present to potential customers, this study is done from the outside, which may give websites more information on which to base their decisions.


Monday, April 20, 2009


Problem: There is a growing set of Internet-based services that are too big, or too important, to run at a single site. Examples include Web services for e-mail, video and image hosting, and social networking. This is the same reasoning behind Facebook’s need to scale out (see below).

Solution: WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. Different distributed applications might need different properties in a storage system: they might need to see the latest copy of some data, and be willing to pay a price in high delay, or they may want data to be stored durably, or have specific preferences for which site stores a document. Thus, in order to be a usable component in many different systems, a distributed storage system needs to expose a level of control to the surrounding application.
WheelFS allows applications to adjust the default semantics and policies with semantic cues, while maintaining the standard POSIX interface. WheelFS offers a small set of cues in four categories: placement, durability, consistency, and large reads.

Tradeoffs: Every file or directory object has a single “primary” WheelFS storage server that is responsible for maintaining the latest contents of that object. WheelFS maintains the POSIX interface even in the background and accessing a single file could result in communication with several servers, since each subdirectory in the path could be served by a different primary.
In addition, a disadvantage of cues is that they may break software that parses pathnames and assumes that a cue is a directory. Another is that links to pathnames that contain cues may trigger unintuitive behavior. However, they have not encountered examples of these problems.

Future Influence: The main insight from WheelFS is that different distributed applications need different properties in a storage system. In order to use have a distributed storage system as a usable component upon which many different applications are built, the storage system must allow the application to control various policies.
It is unclear whether a truly wide-area distributed storage system is the best approach. A middle ground may be found with distributed storage within each datacenter with updates pushed out to the other datacenters. This will create greater isolation between datacenters to prevent cascading failures.

Saturday, April 18, 2009

Scaling Out

Problem: As a website grows it scales up it’s datacenter. However, it will eventually need to scale out to other datacenters for a number of reasons, including space, power, and disaster recovery. With only one datacenter, in the event of a disaster, Facebook could be unusable for extended periods of time. Also, building datacenters in different geographic regions can reduce latency. For websites such as Facebook, this is a great concern because a page is built from a variety of sub-services.

Solution: The first step was to build out the servers and the physical space in Virginia. FB then brought up the intra-datacenter network and a low latency inter-datacenter fiber channel link. With the network and hardware in place FB set up their standard 3 tier architecture: web server, memcache server, and MySQL database. The MySQL databses in Virginia run as slaves of the west coast databases, so FB spent a couple weeks copying streams. After the hardware, network, and basic infrastructure was set up it was time to face the two main application level challenges: cache consistency and traffic routing.

Cache Consistency
FB’s caching model: when a user modifies a data object the infrastructure will write the new value in to a database and delete the old value from memcache (if it was present). The next time a user requests that data object they pull the result from the database and write it to memcache. Subsequent requests will pull the data from memcache until it expires out of the cache or is updated again.
Problem: Now let's say we delete the value from Virginia memcache tier at the time we update the master database in California. A subsequent read from the slave database in Virginia might see the old value instead of the new one because of replication lag. Then Virginia memcache would be updated with the old (incorrect) value and it would be “trapped” there until another delete. In the worst case the Virginia memcache tier would always be one “version” behind of the correct data.
Solution: FB made a small change to MySQL that allows them to tack on extra information in the replication stream that is updating the slave database. They used this feature to append all the data objects that are changing for a given query and then the slave database “sees” these objects and is responsible for deleting the value from cache after it performs the update to the database. The changes to MySQL are only 150 lines of code, so merging with a new version of MySQL is not difficult.

Page Routing
Problem: Only FB’s master databases in California could accept write operations. This fact means FB needed to avoid serving pages that did database writes from Virginia because each one would have to cross the country to the master databases in California. Fortunately, FB’s most frequently accessed pages (home page, profiles, photo pages) don't do any writes under normal operation. When a user makes a request for a page, FB must decide if it is “safe” to send to Virginia or if it must be routed to California.
Solution: One of the first servers a user request to Facebook hits is a load balancer. This load balancer has the capability to make routing decisions based on the user request. This feature meant it was easy to tell the load balancer about “safe” pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location.
Problem with Solution: Going to the edit profile.php page to change your hometown isn't marked as safe, so it gets routed to California and you make the change. Then you go to view your profile and, since it is a safe page, you get sent to Virginia. Because of the replication lag, you might not see the change you just made, which leads to double posting. FB resolves this concern by setting a cookie in your browser with the current time whenever you write something to the databases. The load balancer looks for that cookie and, if it notices that you wrote something within 20 seconds, will unconditionally send you to California.

Trade offs/Remaining Challenges: The main scaling challenge with this architecture is that all write operations must happen in one location. FB feels this approach does scale out to other datacenters, with the obvious limitation that there is a single master to accept writes. FB can just keep buying servers in that datacenter to keep up with the write traffic. The downside is the extra latency paid by users who are “far” from the datacenter on their write operations. In addition, this restriction requiring all writes to happen in one location results in a number of hacks to mitigate the problems, even in normal operation. Writes should be able to occur at any location with the updates being pushed out to other datacenters as they currently are from California.
If the west coast datacenter goes down FB is in very bad shape. FB is working on a disaster recovery procedure where they “upgrade” Virginia to the master but it's a tricky process because of all the one-off services FB has running in California that need to be properly replicated.

Conclusion: FB’s scaling out seems to be effective for the purpose of meeting the requests from their rapidly growing user base. The extra latency paid by users who are “far” from the datacenter on their write operations is an acceptable tradeoff for now, considering the rapid growth of their user base. However, the potential failure of the California datacenter is a frightening scenario. FB is a young company and has never been tested under such a scenario.

Future Influence: FB has given a good roadmap for scaling out quickly with a solution that works in serving user requests. Keeping the writes at a single data center can be sidestepped because they are more limited and can still be scaled up at that datacenter. However, in the future, FB will abandon this anchor holding it back in order to protect itself against the failure of the California datacenter and bring writes closer to where they are more likely to be used (where they were applied), minimizing latency and the need for as many hacks to resolve the issues it creates.

Wednesday, April 15, 2009

Open Clouds

Open Cloud Manifesto, Cloud: commodity or proprietary, Portable Cloud Computing, AppDrop all deal with the same problem.

Problem: The problem is that cloud providers are offering different clouds that differ in basic interfaces and level of access to underlying components. For example, Amazon's services are bottom up (here's a CPU and storage, now install your own software and use the components you need) and Google's is top down (write some code to these APIs and we'll scale it for you). One’s application code becomes dependent on a particular vendor.

Once an application becomes sufficiently complex, moving it from one cloud to another becomes difficult, placing folks at the mercy of their cloud provider. Cloud providers have every incentive to develop proprietary APIs in order to lock folks into their services. Also, moving lots of data between providers may make this difficult or expensive to do in practice (data inertia).

This is in contrast to most web applications today. The LAMP stack allows one to build vendor-neutral applications from free parts and select from a competitive, commodity hosting market.

Solution: The Open Cloud Manifesto goes over several challenges to cloud adoption, such as the portability and interoperability, that must be addressed through open collaboration. It suggests that this is the right time for industry participants to work to ensure that innovations should be guided by the principals of openness so that customers will feel confident in building their applications to run on cloud providers.

AppDrop is a container for applications developed with the Google App Engine SDK, to run entirely on Amazon's EC2 infrastructure, and they work without modification. However, this simple portability comes at the cost of scalability. The App Engine SDK doesn't use BigTable for its datastore, instead relying on a simple flat file on a single server. This means there is no scalabity, but for apps with limited resource needs, something as simple as AppDrop would work (of course, scalability is a primary reason for using a cloud).

Hadoop provides a lot of the building blocks for building cloud services. Its current focus is on batch computing, but several of its components are also key to cloud hosting. HDFS provides a scalable, distributed filesystem. HBase or Couch DB provide a database comparable to Amazon’s Simple DB and Google’s Datastore API.

Future Influence: The open-source community will likely play a significant role in interoperability and migration between clouds. Also, in order to take market share from competitors, cloud providers may build APIs that allow application builders to more easily move. AppDrop is a basic start. Amazon may consider building on this to more effectively tie in the Google App Engine API with Amazon's underlying services in order to provide the same scalability.

Wednesday, April 1, 2009


Problem: There is a need for languages that allow you to write reliable, scalable, production distributed systems.

Solution: What you need is true concurrency. You need lightweight processes, no shared memory, asynchronous message passing and mechanisms to change code on the fly so that programs can evolve and change as they run in non-stop systems.

Erlang’s main idea is to give you a more restrictive programming model provides good primitives for writing distributed applications while limiting features that can cause problems.

Concurrency - Erlang has extremely lightweight processes whose memory requirements can vary dynamically. Processes have no shared memory and communicate by asynchronous message passing. Erlang supports applications with very large numbers of concurrent processes. No requirements for concurrency are placed on the host operating system.

Distribution - Erlang is designed to be run in a distributed environment. An Erlang virtual machine is called an Erlang node. A distributed Erlang system is a network of Erlang nodes. An Erlang node can create parallel processes running on other nodes, which perhaps use other operating systems. Processes residing on different nodes communicate in exactly the same was as processes residing on the same node.

Robustness - Processes in a distributed system can be configured to fail-over to other nodes in case of failures.

Hot code upgrade - Many systems cannot be stopped for software maintenance. Erlang allows program code to be changed in a running system. So it is possible to install bug fixes and upgrades in a running system without disturbing its operation.

External interfaces - Erlang processes communicate with the outside world using the same message passing mechanism as used between Erlang processes.

Erlang’s evolution from a telecom environment means that it was developed a high degree reliability. Telecoms also require upgrades with no down time and handling/isolation of failures.

Criticism: “I think the problem with Erlang is that it's more than 20 years old and for most of this time haven't been exposed enough to developer community at large. It's like raising a child in a cellar for all its childhood and don't let it interact and learn from his/her peers.” This quote from another commenter sums up my central criticism of Erlang.

Tradeoffs: The main trade-off is that one gains the inherent concurrency and parallelization of processes while having to deal with an unwieldy 20 year old language. One should be able to use another programming language with a model of actors communicating by asynchronous messages and prohibiting shared state between actors.

Future Influence: Amazon, Facebook and other companies are now using Erlang for web systems. Erlang is being used because companies select the technology that is best suited to survive while evolving. I expect that other languages will be developed over time that will give us that will give us the best of modern languages and merely prohibit features that are not well handled in distributed systems, such as shared state.