Problem: Many large websites have thousands to hundreds of thousands of machines across many geographical dispersed datacenters. These machines need to send log data to a central repository for analysis.
Solution: Scribe is Facebook's scalable logging system. Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination.
The next step is to do something useful with the data from the logging system. Identifying log messages with a category effectively presorts the data, allowing it to be logged to separate files based on that category.
Hive is Facebook's dataharehouse and the output of Scribe. Hive makes it easy for business analysts to ask ad hoc questions of terabytes worth of logfile data by abstracting MapReduce into a SQL like dialect.