Tuesday, February 10, 2009

GFS: The Google File System

Problem: Google required a scalable distributed file system for large distributed data-intensive applications. This file system had to provide fault tolerance while running on inexpensive commodity hardware, and deliver high aggregate performance to a large number of clients.

Solution/Analysis: The view of GFS begins with an interface that allows users to have access to basic file commands. These include commands like open, create, read, write and close files. The team also included a couple of specialized commands: append and snapshot. Append allows clients to add information to an existing file without overwriting previously written data. Snapshot creates quick copy of a computer's contents.

Google organized the GFS into clusters of computers, built from many inexpensive commodity components that often fail. Within GFS clusters there are three kinds of entities: clients, master servers and chunkservers. A "client" refers to any entity that makes a file request. Requests can range from retrieving and manipulating existing files to creating new files on the system. Clients can be other computers or computer applications.

Master server: The master server acts as the coordinator for the cluster. The master's duties include maintaining an operation log, which keeps track of the activities of the master's cluster. The operation log helps keep service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored the operation log can take its place. The master server also keeps track of metadata, which is the information that describes chunks. The metadata tells the master server to which files the chunks belong and where they fit within the overall file. Upon startup, the master polls all the chunkservers in its cluster. The chunkservers respond by telling the master server the contents of their inventories. From that moment on, the master server keeps track of the location of chunks within the cluster.
There's only one active master server per cluster at any one time (though each cluster has multiple copies of the master server in case of a hardware failure). GFS avoids a bottleneck by minimizing the messages the master server sends and receives. The master server doesn't actually handle file data at all, which is up to the chunkservers.

Chunkserver: GFS files tend to be very large, usually in the multi-gigabyte (GB) range. GFS breaks files up into manageable chunks of 64 megabytes (MB) each. Every chunk receives a unique 64-bit identification number called a chunk handle. Chunkservers are responsible for storing the 64-MB file chunks. Chunkservers send requested chunks directly to the client. GFS copies every chunk multiple times and stores it on different chunkservers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired. I wonder if GFS has been updated with some policy for dynamic replication based on need.

File Requests: GFS separates replicas into two categories: primary replicas and secondary replicas. There does not seem to be a fundamental need for this differentiation.

Read request: The client sends a read request to the master server to find out where the client can find a particular file on the system. The server responds with the location for the primary replica of the respective chunk. This is reminiscent of how the old Napster worked: go to a directory (master server) which tells you where to get the data. Modern distributed file systems can give you the “nearest” copy as well.

Write requests: The client sends a request to the master server, which replies with the location of the primary and secondary replicas. If a client creates a write request that affects multiple chunks of a particularly large file, the GFS breaks the overall write request up into an individual request for each chunk.

Stale Chunks: Whenever a client changes a chunk, the master server assigns it a new number. This is how the master server can tell the difference between valid chunks and invalid copies known as stale chunks.

Reliable system built on unreliable components: The GFS components give system updates through electronic messages called heartbeats and handshakes. These handshakes allow the master server to stay current with each chunkserver's status. The GFS developers built functions into the system to compensate for the inherent unreliability of individual components. Those functions include master and chunk replication, a streamlined recovery process, rebalancing, stale replica detection, garbage removal and checksumming.

Bandwidth focus: Google developers are greatly concerned with bandwidth because Google applications manipulate very large files and many users are making requests. By lagging behind the leading edge of hardware technology, Google can purchase equipment and components at bargain prices. The structure of the GFS is such that it's easy to add more machines at any time and rebalance the workload.

Future viability: GFS seems to provide Google the necessary file system for an application such as search. However, this paper was written five years ago and already Google has expanded into providing many different applications. Google’s assumptions about file use will not hold under many new applications which will require frequent updates to data which will need to be made available for access. This paper does provide a basic principle that will hold: data can only be made reliably available to many clients through replication across many servers, each of which can independently serve the data to the client.

No comments:

Post a Comment