Solution: The authors developed Chukwa on top of Hadoop as an implementation of such a data collection system. The goal was to be able to monitor Hadoop clusters of 2000 nodes, outputting 5 to 6 MB of data per second, and to have collected data available for processing within ten minutes.
At the heart of any data collection system is a pipeline to pump data from where it is generated to where it is stored. Unfortunately, HDFS is not designed for the sort of workloads associated with monitoring. HDFS aims to handle large ﬁles and high write rates from comparatively small numbers of writers. It is not designed for thousands of concurrent low-rate writers, and millions of small ﬁles.
Much of the Chukwa design was driven by the need to reconcile our many sporadic data sources with HDFS’s performance characteristics and semantics. Adaptors are the data sources. Rather than have each adaptor write directly to HDFS, data is sent across the network to a collector process, that does the HDFS writes. Each collector receives data from several hundred hosts, and writes all this data to a single sink ﬁle. Collectors thus drastically reduce the number of HDFS ﬁles generated by Chukwa, from one per machine or adaptor per unit time, to a handful per cluster. The decision to put collectors between data sources and the data store has other beneﬁts. Collectors hide the details of the HDFS ﬁle system in use, such as its Hadoop version, from the adaptors.