ISTC Releases Open Source Code for BigDAWG Polystore System

By Dr. Tim Mattson, Intel and Dr. Vijay Gadepally and Kyle O’Brien, MIT Lincoln Laboratory

Today, the ISTC for Big Data released the first version of BigDAWG, our polystore system for simplifying integration and analytics of disparate data at scale. BigDAWG is open-source software and available for download at bigdawg.mit.edu.

BigDAWG should be of interest to anyone seeking a simpler way to use data that spans multiple data models and data stores, such as research analysts, data scientists and database administrators. But we’ve worked hard to make it easy for anyone to try BigDAWG by releasing the code in a set of Docker containers that will automatically run a cluster of three different database engines.

BigDAWG, which stands for “Big Data Working Group,” is the culmination of several years of intense collaborative work by researchers from Intel and several academic institutions. Many of these researchers had previously pioneered concepts and technologies fundamental to the problem we’re trying to solve, giving us a big head start on executing our polystore vision.

Analyzing a complex ocean metagenomics dataset is simplified with BigDAWG

Analyzing a complex ocean metagenomics dataset is simplified with BigDAWG

While we’re very excited about where we are, we should note that the concept of a polystore system isn’t necessarily new. We have, however, seen a recent resurgence of interest in them by academic and industry researchers. We’ve seen this in numerous conferences as well as during a workshop on polystores that we organized at the IEEE Big Data 2016 conference in Washington, DC. We were impressed and encouraged by the wave of work going on in this area, both in the US and internationally.

We found that polystore solutions greatly simplify writing complex analytics and that in many cases, there are performance gains to be made by matching the storage engine to the data.

The concept of polystores relies on two fundamental observations: (1) there is no “one-size-fits-all” in databases, making specialty database engines for different types of data (relational, array, text, streaming, etc.) a necessary reality in many applications and (2) complete functionality and performance of underlying database systems must be supported (it’s counterproductive to add layers of middleware that get in the way of core database operations).

Polystore systems tackle the challenges of integrating and accessing heterogeneous data at scale by supporting: (1) multiple storage engines based on different data models and (2) middleware to interact with storage engines through a common interface.

The BigDAWG system is our prototype implementation of our polystore concept. While we have made a start with this version, there is plenty of room for innovation. Please contact us if you’re interested in contributing or have polystore research ideas!

Inside BigDAWG

Writing connectors across multiple disparate engines may quickly lead to an N2 problem – you need N2 connectors to integrate N disparate systems. BigDAWG dramatically reduces the need to write single connectors between each and every database that you want to connect. In the course of creating BigDAWG, we worked with two real-life use cases involving complex, multimodal datasets:

  • MIMIC II, an openly available data set developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
  • Ocean metagenomics data from The Chisholm Lab at MIT

We found that polystore solutions greatly simplify writing complex analytics and that in many cases, there are performance gains to be made by matching the storage engine to the data.

The BigDAWG architecture consists of four distinct layers: database and storage engines; islands; middleware and API; and applications. The initial release of the BigDAWG system supports three open-source database engines: PostgreSQL (SQL), Apache Accumulo (NoSQL), and SciDB (NewSQL) along with support for relational, array and text islands (see Figure 1).

The BigDAWG Architecture

Figure 1: The BigDAWG Polystore Architecture

Islands essentially provide users with an abstraction of a data model and a query language along with a set of candidate database engines. The middleware receives a query and passes it on to the appropriate island or islands for execution. Writing a connector to an island should allow you to communicate with other systems connected to that island. Our initial release supports a relational island, an array island and a text island. Look for more islands, such as a streaming island, in the future.

For a more detailed overview of the components in the first BigDAWG release, download this short paper.

Figure 2 shows a user’s-eye view of BigDAWG software components.

The BigDAWG Polystore System Overview

Figure 2: The BigDAWG Polystore System Overview

Users primarily interact with the Query Endpoint, which accepts queries, routes them to the Middleware, and responds with results. The Catalog is a PostgreSQL engine containing metadata about the other engines, datasets, islands and connectors managed by the Middleware. While our initial release relies on Docker for simplifying the installation and startup experience, the Middleware can also run on a server and connect to existing database engines.  For a detailed description of the Middleware subcomponents, see the paper “The BigDAWG Polystore System and Architecture.

Trying BigDAWG

To demonstrate how BigDAWG works on real data, the initial release includes scripts that let you download publicly available parts of the MIMIC II medical dataset and load them into suitable engines. Patient history data is inserted into PostgreSQL, physiologic waveform data is inserted to SciDB, and free-form text data is inserted into Accumulo.

You can launch the Middleware and database engines and issue cross-engine queries – the entire process is automated. We’ve included a number of example queries and an administrative interface to start, stop and view the status of a BigDAWG cluster. In a few minutes, you can have three databases running in containers and issue queries to them without having to install the databases permanently.

We hope that you will try BigDAWG with MIMIC II and let us know what you think – including suggestions for adding modules to connect to another database system. Our goal is to make things easy and as automated as possible.

We also hope that BigDAWG will be a platform that stimulates our fellow database researchers to explore further questions such as:

  • What are the appropriate abstractions in query languages, data transformations, and data representations behind a polystore system?
  • What does an API that speaks across different languages look like?
  • Can we leverage engine-specific capabilities to optimize queries when data is distributed across heterogeneous engines?

Please visit our project page at bigdawg.mit.edu for more details, graphics, and a full list of links to papers, presentations, and other resources.

Finally, we’d like to acknowledge all of our collaborators on BigDAWG, including:

  • Professor Magdalena Balazinska, University of Washington
  • Professor Ugur Cetintemel, Brown University
  • Peinan Chen, MIT CSAIL
  • Adam Dziedzic, University of Chicago
  • Professor Aaron J. Elmore, University of Chicago
  • Brandon Haynes, University of Washington
  • Professor Jeffrey Heer, University of Washington
  • Professor Bill Howe, University of Washington
  • Dr. Jeremy Kepner, MIT Lincoln Laboratory
  • Professor Tim Kraska, Brown University
  • Professor Samuel Madden, MIT CSAIL
  • Professor David Maier, Portland State University
  • Dr. Stavros Papadopoulos, Intel
  • Dr. Jeff Parkhurst, Intel
  • Surabhi Ravishankar, Northwestern University
  • Professor Jennie Rogers, Northwestern University
  • Zuohao She, Northwestern University
  • Professor Michael Stonebraker, MIT CSAIL
  • Dr. Nesime Tatbul, Intel
  • Dr. Kristin Tufte, Portland State University
  • Manasi Vartak, MIT CSAIL
  • Professor Stan Zdonik, Brown University

Additional Reading:

Methods to Manage Heterogeneous Big Data and Polystore Databases. Workshop at IEEE Big Data 2016.

Genomics Data, Analytics and the Future of Climate Change.  ISTC for Big Data Blog, August 12, 2016.

Stonebraker, M., & Cetintemel, U. (2005, April). ““One size fits all”: An idea whose time has come and gone In 21st International Conference on Data Engineering (ICDE’05) (pp. 2-11). IEEE.

J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, S. Zdoânik. “The Big Dawg Polystore System,” ACM Sigmod Record, 44(3), 2015.

Gadepally, V., Chen, P., Duggan, J., Elmore, A., Haynes, B., Kepner, J., Madden, S., Mattson, T. & Stonebraker, M. (2016, December). The BigDAWG polystore system and architecture. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE (pp. 1-6). IEEE.

Chen, P., Gadepally, V., & Stonebraker, M. (2016, December). The BigDAWG monitoring framework. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE (pp. 1-6). IEEE.

Gupta, A. M., Gadepally, V., & Stonebraker, M. (2016, December). Crossengine query execution in federated database systems. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE (pp. 1-6). IEEE.

She, Z., Ravishankar, S., & Duggan, J. (2016, December). BigDAWG polystore query optimization through semantic equivalences. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE (pp. 1-6). IEEE.

Dziedzic, A., Elmore, A. J., & Stonebraker, M. (2016, December). Data transformation and migration in polystores. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE (pp. 1-6). IEEE.

Elmore, A., Duggan, J., Stonebraker, M., Balazinska, M., Cetintemel, U., Gadepally, V., Heer, J., Howe, Kraska, T., Madden, S., Maier, D., Mattson, T., Papadopoulos, S., Parkhurst, J., Tatbul, N., Vartak, M. &  Zdonik, S. (2015). A Demonstration of the BigDAWG polystore system. Proceedings of the VLDB Endowment, 8(12), 1908-1911.

Saeed, M., Villarroel, M., Reisner, A. T., Clifford, G., Lehman, L. W., Moody, G., & Mark, R. G. (2011). “Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database.Critical care medicine, 39(5), 952.

This entry was posted in Analytics, Big Data Architecture, Data Management, Databases and Analytics, DBMS, ISTC for Big Data Blog, Polystores, Query Engines, Storage and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


6 − three =