Problem: As a website grows it scales up it’s datacenter. However, it will eventually need to scale out to other datacenters for a number of reasons, including space, power, and disaster recovery. With only one datacenter, in the event of a disaster, Facebook could be unusable for extended periods of time. Also, building datacenters in different geographic regions can reduce latency. For websites such as Facebook, this is a great concern because a page is built from a variety of sub-services.
Solution: The first step was to build out the servers and the physical space in Virginia. FB then brought up the intra-datacenter network and a low latency inter-datacenter fiber channel link. With the network and hardware in place FB set up their standard 3 tier architecture: web server, memcache server, and MySQL database. The MySQL databses in Virginia run as slaves of the west coast databases, so FB spent a couple weeks copying streams. After the hardware, network, and basic infrastructure was set up it was time to face the two main application level challenges: cache consistency and traffic routing.
FB’s caching model: when a user modifies a data object the infrastructure will write the new value in to a database and delete the old value from memcache (if it was present). The next time a user requests that data object they pull the result from the database and write it to memcache. Subsequent requests will pull the data from memcache until it expires out of the cache or is updated again.
Problem: Now let's say we delete the value from Virginia memcache tier at the time we update the master database in California. A subsequent read from the slave database in Virginia might see the old value instead of the new one because of replication lag. Then Virginia memcache would be updated with the old (incorrect) value and it would be “trapped” there until another delete. In the worst case the Virginia memcache tier would always be one “version” behind of the correct data.
Solution: FB made a small change to MySQL that allows them to tack on extra information in the replication stream that is updating the slave database. They used this feature to append all the data objects that are changing for a given query and then the slave database “sees” these objects and is responsible for deleting the value from cache after it performs the update to the database. The changes to MySQL are only 150 lines of code, so merging with a new version of MySQL is not difficult.
Problem: Only FB’s master databases in California could accept write operations. This fact means FB needed to avoid serving pages that did database writes from Virginia because each one would have to cross the country to the master databases in California. Fortunately, FB’s most frequently accessed pages (home page, profiles, photo pages) don't do any writes under normal operation. When a user makes a request for a page, FB must decide if it is “safe” to send to Virginia or if it must be routed to California.
Solution: One of the first servers a user request to Facebook hits is a load balancer. This load balancer has the capability to make routing decisions based on the user request. This feature meant it was easy to tell the load balancer about “safe” pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location.
Problem with Solution: Going to the edit profile.php page to change your hometown isn't marked as safe, so it gets routed to California and you make the change. Then you go to view your profile and, since it is a safe page, you get sent to Virginia. Because of the replication lag, you might not see the change you just made, which leads to double posting. FB resolves this concern by setting a cookie in your browser with the current time whenever you write something to the databases. The load balancer looks for that cookie and, if it notices that you wrote something within 20 seconds, will unconditionally send you to California.
Trade offs/Remaining Challenges: The main scaling challenge with this architecture is that all write operations must happen in one location. FB feels this approach does scale out to other datacenters, with the obvious limitation that there is a single master to accept writes. FB can just keep buying servers in that datacenter to keep up with the write traffic. The downside is the extra latency paid by users who are “far” from the datacenter on their write operations. In addition, this restriction requiring all writes to happen in one location results in a number of hacks to mitigate the problems, even in normal operation. Writes should be able to occur at any location with the updates being pushed out to other datacenters as they currently are from California.
If the west coast datacenter goes down FB is in very bad shape. FB is working on a disaster recovery procedure where they “upgrade” Virginia to the master but it's a tricky process because of all the one-off services FB has running in California that need to be properly replicated.
Conclusion: FB’s scaling out seems to be effective for the purpose of meeting the requests from their rapidly growing user base. The extra latency paid by users who are “far” from the datacenter on their write operations is an acceptable tradeoff for now, considering the rapid growth of their user base. However, the potential failure of the California datacenter is a frightening scenario. FB is a young company and has never been tested under such a scenario.
Future Influence: FB has given a good roadmap for scaling out quickly with a solution that works in serving user requests. Keeping the writes at a single data center can be sidestepped because they are more limited and can still be scaled up at that datacenter. However, in the future, FB will abandon this anchor holding it back in order to protect itself against the failure of the California datacenter and bring writes closer to where they are more likely to be used (where they were applied), minimizing latency and the need for as many hacks to resolve the issues it creates.