By Barzan Mozafari, University of Michigan Ann Arbor*
Today, if you have a few terabytes of data stored on your disk, even calculating a simple average can take up to a few days. Of course parallelism helps, but not too much. Say you have 7.5 TB of data partitioned across a cluster of 100 Amazon EC2 machines and use Hive/Hadoop to calculate your average. This will cut down your run-time to about 1.5 hours. See the red bars in Figure 1, where we have measured Hadoop’s run-time for this query on two different data sizes.
But waiting 1.5 hours for the result of a simple average query is no fun. For exploratory data analytics you need more of an interactive experience. Your analyst would calculate different statistics along different dimensions of your data and, depending on the result, would slice and dice your data differently. So, depending on the output of each query, the analyst might modify his/her next query accordingly. This is why waiting an hour for the result of a simple average query is unacceptable for many applications. For example, imagine a root-cause analysis scenario. It will be very difficult to establish the real cause if each time you have to wait an hour to test a new hypothesis. But can you do better than this?
Keeping Your Data In Memory?
In data analytics with Hadoop, it turns out that a good portion of your total time is spent reading data from disks. Obviously, if you cached your data in the memory in advance, it would help your latency. This is the idea behind Spark, which is essentially an in-memory implementation of MapReduce. Of course this only helps if your data fits in the aggregate memory of your cluster. With 100 EC2 machines, you get 6 TB of memory, so Spark would cut down your processing time by an order of magnitude if you had, say, 2.5 TB of data. But this also means that Spark wouldn’t help much with your 7.5 TB data situation: it would still take you 1.5 hours to run your simple query (see Figure 2).
So are you on your own if your data exceeds your memory size?
BlinkDB: Querying Very Large Data Sets in a “Blink-Time”
No, there’s still hope for you! This is why we’ve been building a new query processing system, called BlinkDB: Wouldn’t it be great if you could query tens of terabytes (or even petabytes) of data and still get your answer back in a “blink-time?” This was the goal that we set out for ourselves when we started the BlinkDB project.
The key observation that we make in BlinkDB is quite simple: in real life, there are many situations in which one can make perfectly reasonable decisions without having access to perfectly accurate answers!
For instance, in root-cause analysis, as long as you can “confidently compare” two possible causes, you can still reliably establish the actual cause without computing all the statistics down to their last precision. BlinkDB builds a “carefully chosen sample” of your original data and runs your query only on that sample. The result is obviously going to be an approximate one, accompanied with some confidence intervals. For example, when BlinkDB gives you an approximate answer of 20, it will also guarantee that, with 95% confidence, the actual answer would be within +/- 1% of the approximate answer, namely [19.8,20.2].
The real performance gain in BlinkDB comes from the law of large numbers. If you haven’t heard of this law in statistics, I’ll give you an informal version of it: Most of the time, you can get a “highly accurate” estimate of a statistic, by only looking at a “very small portion” of your entire data.
You can find more technical details in in our paper but to give you an idea of what this means in practice, let’s go back to our simple query above: the average. BlinkDB can give you an answer that’s 99% accurate (in other words, it’s only off by 1%) within only 2 seconds! (See the yellow bars in Figure 3.)
Of course, you might say that I’m comparing apples and oranges: Hadoop and Spark give you accurate answers while BlinkDB only gives you an approximation. And, you’re absolutely right. However, for many applications, trading 1% accuracy for making your query run 12x faster (compared to Spark when your data fits in memory) and 200x faster (compared to both Hadoop and Spark, when your data is larger than your memory) is quite attractive!** Also, in many situations your original data itself might come with an intrinsic error (say, due to sensor readings, human mistakes, or other sources of error) larger than 1%.
There are a lot of challenges that BlinkDB overcomes to deliver this outstanding performance. For instance, how to decide which parts of the original data to include in the samples, how to support a wide range of queries without the user having to declare them in advance, and how to know when the approximation is unreliable. If you’re an academic or a person who wants to see under the hood, you can find the answers to these questions in our paper, which won this year’s EuroSys best paper award. But if you’re a practitioner with tons of data, and you don’t mind running your queries 200x faster (or using two orders of magnitude fewer machines for the same query), then you can just go ahead and install BlinkDB and give it a try. We’ve released BlinkDB under an open-source license, so you don’t have to pay for it either.
BlinkDB is a joint collaboration between MIT CSAIL and UC Berkeley’s AMPLab (and soon, University of Michigan Ann Arbor as I’ll be joining them as an assistant professor in the Fall!). BlinkDB is developed by Sameer Agarwal (UC Berkeley), Barzan Mozafari (MIT CSAIL), Aurojit Panda (UC Berkeley), Henry Milner (UC Berkeley), Samuel Madden (MIT CSAIL), and Ion Stoica (UC Berkeley).
**Accounting and billing are obvious exceptions here.
*Barzan Mozafariis an assistant professor at the University of Michigan (Ann Arbor). Previously, he was a Postdoctoral Associate at Massachusetts Institute of Technology. He earned his Ph.D. in Computer Science from the University of California at Los Angeles. He is passionate about building large-scale data-intensive systems, with a particular interest in database-as-a-service clouds, distributed systems, and crowdsourcing. In his research, he draws on advanced mathematical models to deliver practical database solutions. He has won several awards and fellowships, including SIGMOD 2012 and EuroSys 2013′s best paper awards.