ISTC for Big Data Researchers Present Work at NEDB Day 2015

ISTC for Big Data principal investigators (PIs) and researchers presented a broad base of research work at New England Database Day 2015, which was sponsored by Microsoft and held at the Stata Center at MIT in Cambridge, Mass., on Friday, January 30, 2015:

  • Kristin Tufte (Portland State University), presented “Adventures in Transportation from a (Big) Data Perspective.” She gave a tour through transportation data sources from a data management perspective, discussing variations in transportation data and the need to combine data from varied sources to provide a complete picture of  a transportation system. She explained the value that computer science research contributes to transportation. Read more about Dr. Tufte’s work here.

  • Carsten Binnig (on sabbatical at Brown University), on behalf of himself and five co-authors, presented “I-Store: Data Management for Fast Networks.” He explained that designers still assume that the network is a bottleneck and so they try to avoid remote data transfers. However, modern RDMA-capable networks such as Infiniband FDR/EDR make remote data transfers almost as fast as transfers from CPU to memory. He explained the effect of this development on design decisions for distributed data management systems for OLTP and OLAP workloads.

  • Andy Pavlo (Carnegie Mellon University), speaking for himself and four co-authors, presented “Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores.” He said that we now know that DBMSs were not designed for today’s parallelism. The complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts. A recent evaluation of concurrency control for OLTP workloads showed that all algorithms fail to scale to 1024 cores. Many-core chips may require a redesigned DBMS architecture, built from the ground up and tightly coupled with the hardware. Read more about this work here.
  • Jeremy Kepner (MIT), on behalf of himself and three co-authors, presented “Associative Arrays: Unified Mathematics for Spreadsheets, Databases, Matrices, and Graphs.” He gave this background: Associative arrays unify and simplify different approaches for representing and manipulating data into two-dimensional views. Specifically, associative arrays (1) facilitate passing data between steps, (2) allow steps to be safely interchanged, and (3) help simplify or eliminate steps. He explained that most database systems naturally support associative arrays via their tabular interfaces and that the D4M implementation of associative arrays uses this feature to provide a common interface across SQL, NoSQL, and NewSQL databases.
  • Alekh Jindal (MIT), speaking for himself and co-author Sam Madden, presented “Preparing Data for the Data Lake.” He warned that data preparation is increasingly becoming one of the biggest challenges in processing big data. While recent tools such as Tamr and Trifacta address the problem of integrating and cleaning the datasets as they come in, preparing these datasets for efficient processing over a variety of query workloads is still challenging. He said a new tool allows for fine-grained data preparation, via a data preparation plan, and efficiently runs this plan while uploading the data to HDFS.  Read more about this work here.
  • Aaron Elmore (MIT CSAIL, now of the University of Chicago) on behalf of himself and five co-authors, presented “The BigDawg Architecture and Reference Implementation,” a new architecture for future big-data applications. Such applications require “big analytics,” real-time streaming support, real-time analytics, data visualization, and cross-storage queries. “One size does not fit all,” so the implementation builds on top of three storage engines, each designed for specialized use cases — plus novel support for querying across multiple storage engines and pioneering solutions to data visualization.

In addition to the presenters, ISTC Big Data researchers and teams displayed posters that described their current research projects:

  • Leilani Battle (MIT):  Making Sense of Temporal Queries with Fine-Grained Provenance

  • Vijay Gadepally (MIT), Sherwin Wu (Quora), Jeremy Kepner (MIT), Samuel Madden (MIT):  Sifter: A Generalized, Efficient, and Scalable Big Data Corpus Generator
  • Ashley Conard (MIT Lincoln Laboratory), Stephanie Dodson (Brown University), Jeremy Kepner (MIT), Darrell Ricke (MIT Lincoln Laboratory):  Using a Big Data Database to Identify Pathogens in Protein Data Space
  • Kayhan Dursun (Brown University), Ugur Cetintemel (Brown University), Tim Kraska (Brown University), Carsten Binnig (on sabbatical at Brown University), Stan Zdonik (Brown University):  HashStash: An Abstraction to Share and Reuse Intermediate Hash Tables for In-Memory Analytics
  • Aaron Elmore (University of Chicago, MIT CSAIL):  DataHub: Collaborative Data Science & Dataset Version Management at Scale
  • Holger Pirk (MIT CSAIL): Matching in Lockstep: Latch-Free Parallelism through Instruction Sharing
  • Carsten Binnig (on sabbatical at Brown University), Ugur Cetintemel (Brown University), Tim Kraska (Brown University), Stan Zdonik (Brown University): Human-in-the-Loop Data Management
  • John Meehan (Brown University), Nesime Tatbul (Intel, MIT), Stan Zdonik (Brown University), Hawk Wang (MIT), Cansu Aslantas (Brown University), Andy Pavlo (Carnegie Mellon University), Michael Stonebraker (MIT), Sam Madden (MIT), Ugur Cetintemel (Brown University), Tim Kraska (Brown University):  S-Store: Streaming Meets Transaction Processing

For short abstracts on these posters, go here.

This entry was posted in Big Data Applications, Big Data Architecture, Computer Architecture, Data Management, DBMS, Graph Computation, ISTC for Big Data Blog, Math and Algorithms, Query Engines, Streaming Big Data and tagged , , , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


5 − one =