The Big Data ISTC: A Retrospection by Michael Stonebraker, Samuel Madden and Timothy Mattson

Stonebraker_Madden_Mattson_081017

The Big Data ISTC is a research project sponsored by Intel that ran for five years (August 2012- August 2017).  This blog post highlights some of the accomplishments and lessons learned during this period.

Big data is usually categorized into a problem in one of three areas (the so-called Three Vs):

  • Volume: A user has a big data problem because he has too much of it and is having trouble managing it all.
  • Velocity: A user has a big data problem because the data is coming at him too fast and his software cannot keep up.
  • Variety: A user has a big data problem because the data is coming at him from too many places and he can’t integrate it effectively.

At the beginning of the ISTC project, we elected to focus on the first two Vs, leaving the variety problem for later research.

Our work can be broken down into the following themes:

  • Innovative big-volume storage engines
  • Big velocity engines and optimizations
  • Big data visualization
  • Polystores

Below we discuss each in turn.

Innovative Big-Volume Storage Engines

One of the key drivers of the Big Volume problem is the gradual replacement of business intelligence (BI) interfaces by data science interfaces. Data science applications revolve around the following paradigm:

Until (tired) {
Data management
Complex analytics
}

Data management entails finding the actual data set of interest—typically using a mix of SQL, extract-transform and load (ETL), and data cleaning routines. To a first approximation, this is SQL with enhancements. Complex analytics revolve primarily around linear algebra codes (linear regression, k-means clustering, singular value decomposition, etc.). It’s obvious that relational DBMSs are not particularly good at array computation and management.  Hence, in our ISTC, we leveraged two main ideas:

Array storage and management: If one is performing primarily array computations, it would make sense to use arrays (and not tables) as the basic building block.

Tight coupling: Performance is almost always optimized by tightly coupling storage and computation. In the relational world, user-defined functions (UDFs) and stored procedures do exactly this. Our thought was to do a similar thing for arrays.

At the start of the project, we had just finished building SciDB, which is an embodiment of these ideas, and it was a system that we used widely. In addition, we built another array implementation, TileDB, which is presently being commercialized.  Lastly, we built a very-high-performance, low-level storage manager, Tupleware, which used compilation techniques to achieve high performance on primitive operations.  All three systems are innovative contributions for addressing the Big Volume problem.

Finally, data science entails constructing prediction models. Each model comes with training data and tuning parameters. Keeping track of all this information is considered Model Management, and we have built an innovative system, ModelDB to address this challenge.

ISTC at Five_081017

Big Velocity Engines and Optimizations

To solve the velocity problem, one makes an assumption from the get-go. Either one ensures no messages are lost and applies “exactly-once” semantics to message processing., or one is willing to “wing it.” The first case is appropriate for applications with important messages (trading messages, money movement, etc.). In this case, applications want guarantees to “never lose my messages.” If one is tagging wildlife in the woods, then losing messages just means that the current location of an object is less precise. From the beginning, we chose to focus on the “important message” case and wrote a very innovative extension to a high performance On-Line Transaction Processing (OLTP) system called S-Store.  It guarantees there is no message loss, and simultaneously runs at higher performance that the current “wing-it” solutions.  In effect you can have your cake and eat it too.

In addition, at the request of Intel, we explored the applicability of NVRAM in high-velocity applications. We built several extensions to H-Store, a previously written high performance OLTP system. We quickly learned that NVRAM is too slow to replace main memory and too expensive to replace disk, so it is expected to be another level in a storage hierarchy.  In addition, we invented innovative crash recovery schemes that are enabled by NVRAM technology.

Finally, we began an investigation into coupling a data warehouse system (volume) with a main memory message processing system (velocity).  One option is to run a “best of breed” in each category and put them in a tandem architecturally. A second option is to design a single system capable of running a mixed workload. This project, Peloton, is in its early stages.

Big Data Visualization

An obvious challenge in applications with large data volume is how to present information to a human user. Historically, information from data warehouses was rendered by business intelligence systems, which would construct simple SQL aggregates.  However, such BI interfaces are not very useful in data science applications, where the user is searching for much more complex interactions. In this world, a “detail-on-demand” paradigm seems more appropriate.  Hence, a human can browse his data at scale, zooming into regions of interest.  Of course, optimizing detail-on-demand systems is crucial, and we built ScalaR to do exactly this.  It performs move prediction, cache optimization, and just in time “data cooking.”

In addition, in the case where custom visualizations are required, one wants a library of building blocks to facilitate easy construction of custom renderings.  We built a tool kit, D4M, as well as a higher-level component library, Vega, that runs on top of D4M.

Polystores

There is an adage, coined by one of us in 2005, that “one size does not fit all.”  In other words, in any complex problem, a user will require multiple DBMSs with different characteristics.  Supporting an application with multiple storage engines, multiple data models, and multiple query languages is a required feature.  One option is to use the relational model as a kind of Esperanto, and try to map other query languages and data models to it. This was the approach largely followed by the Myria project.

The other approach is to assume that there are some data storage systems that are unmappable to a common framework and that multiple data models and query languages are required. This was the approach taken by BigDAWG.

Both systems are operational and represent innovative experimental systems. Both are also available as open source prototypes.

Summary

 We thank Intel for the financial support, contributions by Intel researchers, and freedom to develop and advance these technologies, including getting many into production (open source). We also sincerely thank all the Principal Investigators and their students at our participating institutions: Brown University, Carnegie Mellon, CSAIL at MIT, Northwestern University, Portland State University, University of Chicago, University of Tennessee, and University of Washington. With their expertise, energy and imagination, we were able to accomplish a lot in a short amount of time (for academic research). Finally, we’d like to thank our Advisory Board. It’s been a productive and enjoyable collaboration, all-around.

The projects we mentioned continue. We hope you will follow those of interest and use and/or contribute to the source code.

Links

BigDAWG
D4M
H-Store
ModelDB
Myria
NVRAM
Peloton
ScalaR
SciDB
S-Store
TileDB
Tupleware
Vega

Author Biographies:

Michael Stonebraker, PhD, is co-founder of the ISTC for Big Data. He is an adjunct professor of computer science at CSAIL MIT and winner of the 2014 Turing Award for his fundamental contributions to the concepts and practices underlying modern database systems, A pioneer in database systems architecture, he is credited with bringing relational database systems from concept to commercialization and setting the research agenda for the multibillion-dollar database field for decades.

Samuel Madden, PhD, is co-founder of the ISTC for Big Data.  He is a professor of computer science at CSAIL MIT, His research is in the area of database systems, including main memory databases, data warehousing/analytics, database-as-a-service, and querying data streams and networks of distributed devices such as wireless sensor networks.

Timothy Mattson, PhD, is the Intel Principal Investigator for the ISTC for Big Data and a senior principal engineer in the Intel Parallel Computing Lab. His research focuses on technologies that help programmers write parallel applications, including programming languages (OpenMP and OpenCL), parallel design patterns, and parallel math libraries. Dr. Mattson was lead author on the CIDR 2017 paper“Demonstrating the BigDAWG Polystore System for Ocean Metagenomic Analysis.”

This entry was posted in Analytics, Benchmarks, Big Data Applications, Big Data Architecture, Data Management, Databases and Analytics, DBMS, ISTC for Big Data Blog, Polystores, Query Engines, Storage, Streaming Big Data, Tools for Big Data, Visualizing Big Data and tagged , , , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


three × = 9