Memory Wall? What Memory Wall?

by Holger Pirk, MIT CSAIL*

Background

With current RAM prices at less than $10K for a Terabyte, in-memory data management is viable, even for “big” data – especially for big data. Given enough Variety and Veracity, however, analyzing a Terabyte of “big” data can be challenging. Often, you have to spend considerable time “exploring” your data before you can throw your sophisticated machine learning solution at it. This exploration is mostly interactive, making low response times essential. Luckily, modern CPUs can access memory-resident data at amazing speed: transferring your entire Terabyte of data from the RAM chips to the CPU takes about 4 (four!) seconds. The promise of response times like these is why you invest in an in-memory data management solution.

However, you are unlikely to get this performance out of a “classic” data management system because they are… just not designed for it. They were designed around the concept of the Memory Wall. Data management researchers like to talk about the Memory Wall: we believe(d) that performance of anything we do is going to be limited by the time it takes to get data from primary storage (disk, flash, RAM, …) into the CPU. Unfortunately, it is not quite that simple when your data is memory-resident.

 

Pivoted Partitioning

 

A little bit more than a year ago I had to learn that the hard way: I listened to a (former) colleague of mine give a talk about her tuning efforts on a technique called “Database Cracking”. The operation she was working on was what I’d call “pivoted partitioning”: take a pivot element from a memory-resident array and partition the array such that all values less than the pivot are clustered in the beginning (see the figure above). This operation is the core of Database Cracking (and not entirely coincidently quicksort).

Problem

She presented preliminary results of her efforts to parallelize the algorithm. She told us that she got a 10x speedup by parallelizing it using 16 threads (on a machine with 16 hardware threads). I immediately told her that this cannot be right: An operation as simple as pivoted partitioning is (memory) bandwidth-bound and, since it is bandwidth-bound, it cannot benefit from parallelization. Parallelization increases CPU utilization but doesn’t increase memory bandwidth. I told her that this problem could be (close to) bandwidth-bound using a single thread if it were properly implemented and tuned: a good performance engineer should look at the problem. Since all good performance engineers were busy at the time, I offered my help instead.

Approach

We spent a couple of weeks shooting an arsenal of tuning techniques at the problem: predicated selections, SIMD, Vectorized processing and subword processing. While some of these helped a lot, we were still far from saturating the bandwidth as specified by the respective CPU vendors. So, I grudgingly had to accept the fact that I had been wrong: I couldn’t get this operation to be bandwidth-bound in a single thread – not even close. Not even on a cheap AMD 2-Core machine. So, we had to parallelize.

The straightforward approach to paralleization for data-parallel operations is to partition your input and process each partition in a separate thread. We did exactly that and, by doing so, ended up saturating the memory bandwidth.

Unfortunately, partitioning generally creates the need for a subsequent merge phase. For some problems like aggregation with distributive aggregation functions, the resulting merge is small. For other operations such as sorting or Database Cracking, the merge can involve an additional scan over the entire data. We managed to significantly reduce the size of the merge by estimating partition sizes a priori but we could not eliminate it entirely. In the worst case, our solution yields an extraneous scan. This scan can be parallelized and is, therefore, usually bandwidth-bound. However, the whole pivoted partitioning process is implemented using two scans (one for partitioning, one for merging). This is more than strictly necessary which, in my opinion, violates the definition of bandwidth boundness.

The Takeaway

So what does all of that mean? Is there a Memory Wall or not? Well there is, but you need to go multi-threaded to even see it on the horizon. It is possible to actually hit it once you parallelize and have a CPU-efficient implementation of your solution. Unfortunately the world isn’t just made up from embarrassingly parallel problems. Even in the realm of in-memory analytics, one runs into problems that aren’t trivial to parallelize efficiently. Coming up with efficient parallelization strategies for these is challenging. One is likely to hit the “Brain Wall.” This Brain Wall is what stands between us and true interactive data management. Luckily, there are a number of smart people poking holes in the Brain Wall.

I believe a big (data) challenge will be to make sure that sub-second query response times actually translate into an interactive user experience. Currently, you have to choose: on the one hand, systems like D3 provide a very high level of interactivity but cannot deal with large amounts of data. On the other hand, systems like Spark scale to big data but can hardly be called interactive. We are currently looking into means to cross-breed them to make big data interactive.

If you are interested in the detailed results of our tuning efforts, check out the paper: Database Cracking: Fancy Scan, not Poor Man’s Sort!

Holger Pirk is a Postdoc in the Database Group at MIT CSAIL. He just finished writing his PhD thesis in the Data­base Architec­tures group at CWI in Amster­dam and will defend it on May 1st, 2015 in Amsterdam. He received his master’s degree (Diplom) in computer science at Humboldt-Universität zu Berlin in 2010. His research inter­ests lie in analy­tical query pro­cessing on memory-resident data. In par­ticu­lar, he studies storage schemes and pro­cessing models for modern hardware.

This entry was posted in Big Data Architecture, Computer Architecture, Data Management, DBMS, ISTC for Big Data Blog, Storage and tagged , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


− 4 = four