No Hadoop: The Future of the Hadoop/HDFS Stack

by Michael Stonebraker, MIT CSAIL

There has been a collection of recent announcements about DBMS capabilities in the so-called Hadoop stack.  To be clear, this is a three-tier architecture with HDFS (a file system) at the bottom, Hadoop (the open source version of MapReduce) above that, and a Hive or Pig executor at the top level.  This stack originated with Yahoo, which wrote an open-source version of Google’s MapReduce and layered it on top of HDFS, an open-source version of the Google File System (GFS).  Later Pig (and especially Hive) were added as a query specification layer.

It’s well known in DBMS circles that this architecture offers poor performance, except where the application is “embarrassingly parallel.”  The reasons are many-fold and include no indexing, very inefficient joins in the Hadoop layer, sending the data to the query in the HDFS layer rather than sending the query to the data, and poor compression.  The proponents of this stack (Hortonworks, Cloudera, Facebook) have all realized this, and all have announced or are shipping a “No Hadoop” DBMS.   Here, the MapReduce layer is eliminated, and Hive (or SQL) is processed by a fairly conventional parallel DBMS running directly on HDFS.

Put differently, parallel SQL DBMSs have been on the market for a couple of decades and process SQL against a file system storing the data.  The Hadoop stack first expanded to include an SQL layer, and has now evolved to remove the MapReduce layer in favor of “innards” that closely resemble the current commercial parallel DBMSs.  The reason is straightforward:  the MapReduce layer is too inefficient for production use.

In my opinion, this “performance transition” is now in mid-flight, and a second phase will occur next.  The Hadoop crowd is learning DBMS technology “on the job” and will discover the second performance culprit in short order, namely HDFS.  Since it hides data location from the DBMS, there is no way to reliably “send the query to the data” as is done in all current parallel DBMSs.  Instead you end up bringing the data to the query, which is much less efficient.  If one is executing a file system read, then remote I/O on our configuration is a factor of 50 slower than local I/O.  Of course, DBMSs use substantial computer resources, so the performance difference will be smaller.

A Simple Experiment

To identify the performance difference in a DBMS context, we ran the following simple experiments using the Vertica DBMS.  Specifically, we executed two queries on a pair of tables, one about 10 Gbytes and the other 50 Mbytes.  One filtered the larger table and ran an aggregate, while the second one filtered the larger table, joined it to the smaller table and ran an aggregate on the result.  Both are typical of the workload in decision support systems.  With a cold cache, the pair of queries was a factor of 12 slower when run against a remote disk versus a local disk.  Of course, if the cache is warmer and the tables are smaller, then the performance difference decreases, but in any case was never smaller than a factor of 2.

The second aspect of HDFS concerns replication.   Current HDFS-based DBMSs typically use HDFS replication for high availability and perhaps increased performance.  However, HDFS does bit-wise replication of data objects.  In contrast, parallel DBMSs―for example, Vertica ― often have replicas in different sort orders or using a different storage organization, so that the query optimizer can choose the replica that offers the best performance.  A moment’s reflection should convince the reader that N logical replicas offer up to another factor of N performance gain, relative to physical replicas.  Returning to our simple benchmark, Vertica ran it 1.8 times faster with two replicas in different sort orders relative to two replicas in the same sort order.

Additional Performance Opportunities

In sum, another order-of-magnitude performance gain can be obtained from logical replication and DBMS-aware partitioning of the data.  This is too large a number for production DBMS applications to ignore.  Hence, the second phase of Hadoop DBMS improvement will be to replace the distributed HDFS file system with one that allows higher-level software (including DBMSs) to take advantage of data location.  In addition, logical replication will replace the HDFS-supplied physical replication.

At this point, Hadoop DBMSs will have morphed to exactly mimic current parallel data warehouse products.  Of course, these phases will take a while to occur and probably result in considerable angst to large-scale users of the current three-tier stack.


This entry was posted in Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog and tagged , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

+ one = 4