In 2016, ISTC for Big Data principal investigators, researchers and their students continued to break down the barriers to data analytics at scale, with creative new approaches and infrastructure software. Developments are being integrated into BigDAWG, the next-generation polystore architecture being built by our multi-institution team. BigDAWG will be featured in papers presented next week (January 8-11) at the biennial Conference on Innovative Data Systems Research (CIDR) in Chaminade, California.
Polystores are a more-modern approach to sharing heterogeneous data that addresses big data’s volume, variety and velocity demands. Polystore systems are database federations designed to support many disparate data models, enabling people to ask and answer complex questions that span diverse data sets. We unveiled BigDAWG in August 2015, demonstrating its promising early results in analyzing heterogeneous, multi-modal medical data (MIMIC II)
Polystores Meet the Real World
In 2016, we continued to evolve and refine BigDAWG, demonstrating its potential to succeed at enabling data federation at scale.
We teamed up with MIT’s Chisholm Lab to test BigDAWG on complex, scientific data: in this case, ocean metagenomics data for predicting climate change and other events. (The research team will present its paper on this application at CIDR 2017 on January 9.)
We worked on optimizing query modeling and processing in BigDAWG; speeding up data migration and transformation in polystore systems; integrating real-time data analytics using stream processing: increasing the sophistication and utility of cloud models (e.g., analytics-as-a-service); analytic monitoring for the Internet of Things; and improving visual analysis of large data sets.
Other researchers worked (often with their counterparts at Intel Labs) on harnessing the latest hardware and system-level software for big data analytics. Innovations included write-behind logging and other new approaches for using non-volatile memory (NVM); larger-than-memory data management on modern storage hardware for in-memory OLTP databases; and an algorithm for scalable and high-performing concurrency control for future multi- and many-core systems.
Open-Source Software Available
Researchers continued to make software for BigDAWG component technologies available in open source.
In April, we announced the open-source availability of TileDB, a novel efficient data management system for scientific data, such as graphs, DNA sequences, matrices, maps, and imaging; and GenomicsDB, an adaptation of TileDB for genomics, developed in collaboration with Intel Health and Life Sciences. The Broad Institute, one of the most important genomics institutes in the world, is currently using TileDB in its production pipeline.
Download links to other component technologies–including H-Store, Julia, Myria and PipeGen–are available on our Software page.
Pushing the Edge
In 2016, our researchers also came up with creative approaches for big data management, enabling hybrid analytics; interactive search and exploration over multidimensional data; automated management of machine learning models; and collaborative data management.
Still other researchers explored different paths, including an alternative approach to polystore optimization; rethinking distributed DBMS design in light of next-generation networks; and bringing ad hoc querying to high-performance computing (HPC languages).
For a deeper look at our work and to keep up with 2017 research, visit our research blog.