Our Research | Focus Tasks

To accelerate our research, we’ve identified eight research problems as priorities.  Each problem addresses one or more of our research themes.

We’ve designated each problem as a Focus Task for Year One.  For each task, we have assigned a group of investigators, identified and allocated additional resources, and put the task under the direction of either Mike Stonebraker or Sam Madden.

Focus Task Research Theme
Proofs of Concept Big Data Databases and Analytics
Scalable Visual Interfaces Big Data Visualization
Graphs in DBMSes Big Data Architecture
ScaLAPACK and DBMS Big Data Math and Algorithms. Big Data Architecture
DBMS Functionality Big Data Architecture
Main Memory DBMS Issues Big Data Architecture
Moving Database Functionality into Hardware Big Data Architecture
Scalable Algorithms Big Data Math and Algorithms

Focus Task:  Proofs of Concept

The Goal:   Enable predictive analytics and other sophisticated kinds of analyses on large-scale, very complex data sets.

The Plan:  We are developing proofs of concepts (PoCs) for application needs in three vertical industries, including assembling test data and challenge problems for each application.  For each POC, we will bring up the data and application on the project Big Data server.

The data:  Anonymized patient record data for patient care and research

Clinical researchers, physicians and institutions need to be able to collect, correlate and analyze patient data at scale, so they can predict better patient outcomes, spot inefficiencies in the medical system, and identify more optimal therapies.  However, the complexity of the data outstrips the capabilities and capacity of traditional relational databases. The data is usually highly heterogeneous, involving text data (doctors’ or nurses’ notes), signal data (lab data or telemetry from patient monitoring equipment), and imagery (x-rays or scans).  Computation involves predictive analytics using complicated correlations of data.  By better understanding this multi-modal data and its demands, we hope to enable tools for faster, easier analysis of patient data that can save money, time and lives.

The data:  Telescope/satellite imagery data

We plan to bring up a year’s worth of satellite imagery (e.g., from NASA’s MODIS instrument) for the whole globe.  Our goal is to develop new tools for creating derived data products from the raw data, rather than requiring scientists to use existing data exports that NASA provides.   We plan to add to this data a collection of other kinds of data (such as ground-based sensor data, ocean-based sensor data, and predictive simulation model output). We hope to empower scientists to correlate multi-modal data to perform better science at scale with much less effort.

We chose these three application areas because they represent different kinds of Big Data problems, and because we have expertise in these areas.

Task Leaders:  Mike Stonebraker, Sam Madden

Task Team:  James Frew, Bill Howe

Focus Task:  Scalable Visual Interfaces

The Goal:   Develop scalable visualization systems that make it easier for people to view and manipulate Big Data applications, ideally without expensive specialty hardware

The Plan:  We assume that the application is a visualization system and is connected to a database system.  Visualization systems may analyze a combination of types of data types:  numerical, categorical, time, geophysical, matrix and network. The systems are also interactive in how they specify searches and want to explore the data (zoom, linked selection, etc.)  Hence, the user interface will submit queries that may return very large amounts of information.

The challenge is three-fold: (1) figure out how to use screen real estate more effectively; (2) figure out how to knock down the amount of data returned from a query, when it will overwhelm the rendering system; and (3) develop pre-fetch and cache management strategies to decrease response time.  Then, we will implement the prototypes on the proof of concept (POC) databases.

Task Leader:  Mike Stonebraker

Task Team:  James Frew, Jeffrey Heer, Bill Howe, David Laidlaw, Stan Zdonik

Focus Task:  Graphs in DBMSes

The Goal:  Explore DBMS support for graph-structured problems, which continuously correlate and analyze the changing state of many related items

The Plan:  We want to come up with a very efficient system for analyzing large graphs, or multi-dimensional collections of related data.  Classic examples of graph-structured problems include analyzing a set of web pages and the links between them to determine “page rank” or analyzing a user’s social network to determine friends, products, or ads to show to a user.  Such tasks involve running complex algorithms on top of graphs.

The challenge is that these kinds of analyses, while extremely powerful and valuable, do not fit easily into traditional table-oriented relational DBMS.  We are looking at several possible representations of graphs, with the goal of developing faster more scalable ways to analyze them.

Specific efforts include a sparse-matrix representation of graphs (based on work by Jeremy Kepner) and GraphLab (based on work by Carlos Guestrin), which decomposes iterative graphs algorithms into programs that asynchronously run at each vertex, reading the state of their neighbors and updating their own state.  Specific projects include integrating such sparse-matrix representations into GraphLab, as well extensions to those frameworks to build applications that scale to massive data sets that exceed the RAM of even a cluster of machines.

Task Leader:  Sam Madden

Task Team:  Carlos Guestrin, Jeremy Kepner, Dave Maier, Sam Madden, Mike Stonebraker, Stan Zdonik

Focus Task:  ScaLAPACK and DBMS

The Goal:   Investigate tight integration of array DBMS with ScaLAPACK, to create a faster, more efficient way to analyze very large, complex data sets

The Plan:   Array-model DBMSes store data in vectors versus tables, which enables them to efficiently handle complex analyses on large data sets. ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. The two systems are a natural combination for Big Data; however they have different approaches to storage management, persistence, crash recovery, threading, and communication.  We’ll investigate resolution of these differences and build a pilot tightly integrated system, if possible, that’s greater than the sum of its two parts.  Also, both systems have different ideas for dynamic resource allocation.  We will try to resolve the differences, build a composite resource management system and integrate it with other software systems, if possible, such as Hadoop.

Task Leader:  Mike Stonebraker

Task Team:   Magda Balazinska, Donghui Zhang, Jack Dongarra, Alan Edelman, Jeremy Kepner

Focus Task: DBMS Functionality

The Goal:   Evolve array databases for Big Data, by overcoming limitations in storage management and query languages

The Plan:  We are investigating storage management for array databases, specifically how to “chunk” an array into storage blocks.  We will look at both fixed-size and variable-size schemes. We will also look at schemes to deal with overlapping chunks, as well as cases with extreme skew between the various regions of an array.

We are also investigating query languages for arrays and looking for standards that can be leveraged among the various array implementations.  We will investigate the useful primitives that should be present in any array DBMS.  We will look into sophisticated aggregation and windowing primitives, and investigate high-performance implementations of common array operations such as rank and median.

Task Leader:  Mike Stonebraker

Task Team:  Magda Balazinska,  Ugur Cetintemel, Sam Madden, Dave Maier, Andy Pavlo, Stan Zdonik

Focus Task:  Main Memory DBMS Issues

The Goal:   Identify ways to analyze data quickly, even if it doesn’t fit into memory.

The Plan:  This task will start with the on-going anti-cache project involving VoltDB or HStore.  Here, we run a main memory DBMS and heave to an archive those records that are very cold.  The planned archive is currently disk.  We will extend this work to include heaving data to block-structured flash devices (e.g., Fusion I/O cards) and/or simulators for Phase Change Memory (PCM) and other non-volatile storage technologies on the horizon.   We will compare to a main-memory-only system and to a conventional disk system such as MySQL.  We plan to extend this work further by comparing it with a main-memory-only system extended with flash and by trying it on a PCM simulator.  Other issues we plan to address include considering how transactional memory could be used to help with DBMS recovery and how to take advantage of upcoming Manycore processors in main-memory database applications.

Task Leader:  Mike Stonebraker

Task Team:   Ugur Cetintemel, Pradeep Dubey, Andy Pavlo, Stan Zdonik; Justin DeBrabant, Steven Hu

Focus Task:  Move Database Functionality into Hardware

The Goal:  Investigate how various DBMS-specific computations can be pushed onto hardware, for faster queries and analyses of Big Data

The Plan:  We are investigating how next-generation trends in Intel hardware will accommodate advanced Big Data algorithms and applications.  The task is two-fold.  First, we want to understand how to use new Intel hardware in building new Big Data systems.  Second, we want to understand what features can most help Intel hardware process Big Data faster.

We will be using advancements in Intel hardware, including Manycore devices and graphics processors, to speed up and simplify data-set generation, compression, record evaluation, filtering, thread movement, and other database operations for Big Data.

Task Leader:  Sam Madden

Task Team:  Arvind, Srini Devadas, Pradeep Dubey, Sam Madden, Andy Pavlo, MIT post-doc

Focus Task:  Scalable Algorithms

The Goal:   Identify new algorithms that are fast enough and deep enough for analyzing Big Data.

The Plan:  We are exploring parallel and scalable algorithms in a DBMS context, looking at such aspects as:

Task Leader:  Sam Madden

Task Team:  Piotr Indyk, Tommi Jaakkola, Sam Madden