Building a New Application-to-Hardware Management Stack for Big Data

The Intel Science and Technology Center for Big Data partners – Intel Labs and its seven participating academic research institutions – recently unanimously agreed to focus their research on a pioneering effort: creating a new application-to-hardware big-data management stack. Unique in industry and academia, the stack will combine several tools created by individual ISTC data management research projects with hardware innovations from Intel Labs and the universities, linked by a new, to-be-developed task management layer.  Here, Ted Benson of MIT CSAIL provides some early insight into this ambitious project, called Big Dawg.

At the recent Intel Science and Technology Center for Big Data annual Research Retreat in Hillsboro, Oregon, Michael Stonebraker of MIT CSAIL kicked off a series of talks about Big Dawg, the Big Data Working Group.

The group is beginning to think about the bigger picture of integration between the variety of tools we are building to address the challenges of big data.

Consider the medical data field, for example, which has both enormous diversity of data (waveform data, text, relational data) and also a mix of real-time data (medical devices) and slow-moving data (genomes, medical histories). Since multiple tools will be necessary to approach this diversity of data, a larger integrative layer will help bring them together.

The group suggests approaching the problem with what might be called the LLVM of Big Data: a system stack with a “narrow waist” that acts as the universal, integrative layer. On the bottom of this stack are hardware, analytic libraries, and databases. On the top of the stack are programming languages, visualization, and presentation tools. The narrow waist in the middle is a future Big Dawg Query Language (BQL) and compiler, which provides a representational format for data tasks and can translate those tasks into work within the bottom half of the stack.

 

Big Dawg is a new application-to-hardware big-data management stack. It will combine several tools created by individual ISTC data management research projects with hardware innovations from Intel Labs and the universities, linked by a new, to-be-developed task management layer. (Source: MIT CSAIL.)

This approach has architectural allure, but involves exploration of several non-trivial problems, which the ISTC group is currently investigating:

  1. How can BQL unify (or cover up) the sometimes conflicting semantics of array-based and table-based systems? For example, deleting a cell in an array simply fills it with an N/A value, leaving array cardinality unchanged, whereas deleting a row in a table does change the cardinality. If these two systems are inter-operating, this difference can result in data inconsistency between the two systems.

  1. How do we deal with real-time systems? The mechanisms for handling streaming data tend to result in very different system design from warehouse-style systems. Can the two be brought under the same unifying layer?

  1. How can analytic packages be integrated? On the surface, analytic tools might look similar to databases like SciDB: both use arrays organized into tiles, for example. But if you peel back the layers, you realize they are operating at entirely different scales. One uses small, in-memory, dense formats and the other uses large, compressed, sparse formats. So even though the two appear to have compatible representations on the surface, the actual integration can kill performance.

Watch the ISTC for Big Data blog for updates on the progress of Big Dawg.

This entry was posted in Big Data Architecture, Data Management, ISTC for Big Data Blog, Query Engines, Streaming Big Data and tagged , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


two + 1 =