Pete's Thoughts on Cloud Computing: Parallel Database Systems

Wednesday, October 7, 2009

Parallel Database Systems

Parallel Database Systems: The Future of High Performance Database Systems
by David DeWitt and Jim Gray

I remember reading this paper years ago. Reading it again next to having read the MapReduce paper put the parallel DB mindset into a modern view. Anyone who has read the MapReduce paper should read this article.
MapReduce summary

These papers give perspective on how the problem can expand and be attacked more than a decade into the future. The Parallel DB paper goes over some barriers to speedup: startup, interference, skew. The scale of the problem and needed parallelism has increased dramatically since the parallel DB article was written (multi-PB vs multi-TB). There was also great foresight in this paper concerning the use of local computation on commodity processors and memory (shared-nothing architecture).

MapReduce has attempted to formalize and attach these problems and remove them from the input given by the requester. MapReduce can be thought of as a type of parallel SQL. MapReduce attacks each of the barriers to speedup: (see MapReduce summary)
Startup: MapReduce has streamlined a general purpose.
Interference: The MapReduce workers are distributed evenly enough and as close as possible to the data that must be operated upon.
Skew: MapReduce deals with the straggler program by restarting slow jobs.

Analysis: The Parallel DB paper tried to look ahead at the evolution of high performance DB systems. The authors did not, and could not have foreseen the exponential expansion of the internet. The scale of demand upon DB systems requires a change in mindset. Traditionally, the speed of RDMS has been increased in the background while maintaining all constraints. Today, we accept that there fundamental tradeoffs (see CAP Theorem). By loosening some of the constraints (such as strict consistency), we can achieve the needed performance provided by modern distributed storage systems.