An Interview with Jeremy Kepner of MIT Lincoln Laboratory
MIT recently announced that the MIT Lincoln Laboratory Supercomputing Center (LLSC) system, housed in Holyoke, Massachusetts, has been ranked the most powerful supercomputer in New England. We caught up with ISTC for Big Data Principal Investigator and Lincoln Laboratory fellow Dr. Jeremy Kepner, who heads the LLSC, to learn more about the supercomputer, how it’s helping ISTC research, and his work for the ISTC.
What role have you and MIT Lincoln Laboratory been playing in the ISTC for Big Data?
My team has been providing the big data computing resources, data sets, and demo integration for the BigDAWG polystore system, the ISTC’s capstone project to radically simplify big data management. Our team works on a lot of big data projects and has many decades of experience integrating large and complex data sets to address challenging scientific problems.
How will the new LLSC supercomputer help with the ISTC work?
With its new Dell EMC petaflop-scale supercomputer, the LLSC has 6 times more processing power and 20 times more bandwidth than with its predecessor. This is good news to the more than 1,000 researchers—including ISTC for Big Data researchers—who depend on it in their work.
Researchers use our interactive supercomputing resources to augment their desktop systems—to process large sets of sensor data, create high-fidelity simulations, develop new algorithms and do other compute-intensive work. With the new system, the ISTC can now scale up our BigDAWG architecture to run on much larger platforms.
Can you elaborate a bit on how your own research interests—specifically D4M with its associative arrays—are contributing to the BigDAWG polystore system?
My primary interests are in high-performance computing, parallel algorithms, and computational software. The BigDAWG polystore system is next-generation federation middleware that supports many different data models and databases. It uses a concept called islands of information to unite many, diverse query processing engines, so users can make complex, cross-database queries using their current tools (for example, SQL) and get answers fast and simply.
Associative arrays provide a mathematical model that may encompass many of the diverse databases in BigDAWG. The ability to describe database representations and queries within a single mathematical model is a strong indication that BigDAWG is on the right track.
What’s the latest on your work on BigDAWG?
DM4 is one of two cross-system islands implemented in BigDAWG. The other is Myria, from the University of Washington. Each offers a different interface to an overlapping set of back-end database engines.
Myria has adopted a programming model of relational algebra extended with iteration. Among other engines, it includes shims (cross-database translators) to SciDB and Postgres. Myria includes a sophisticated optimizer to efficiently process its query language.
On the other hand, D4M uses a new data model, associative arrays, as an access mechanism for existing data stores. This data model unifies multiple storage abstractions, including spreadsheets, matrices, and graphs. D4M has a query language that includes filtering, subsetting, and linear algebra operations, and it contains shims to Accumulo, SciDB and Postgres.
BigDAWG currently consists of a “scope-cast facility” with island implementations from Myria and D4M, along with degenerate islands for three production databases (Accumulo, SciDB, and Postgres). To provide the union of the capabilities of the federator’s underlying storage engines, these degenerate islands have the full functionality of a single storage engine—so users don’t lose the capabilities of their databases. We also incorporate access to other experimental database systems. You can read more about the BigDAWG architecture in our paper, “A Demonstration of the BigDAWG Polystore System.”
You’ve got a new book coming out soon from MIT Press: “The Mathematics of Big Data.” What’s it about?
Broadly, it’s about the role of mathematics in taking full advantage of big data. Specifically, it provides a unifying mathematical framework for representing data through all the steps of a machine learning system—one that can reduce the amount of time and effort currently spent on front-end processing for machine learning (typically 90%).
Big data machine learning systems encompass the entire process of parsing, ingesting, querying, and analyzing data to make predictions—the biggest promise of big data. The variety of front-end processing approaches for enabling machine learning systems include data representation, graph construction, graph traversal, and graph centrality metrics. In many cases, well-designed front-end processing can significantly reduce the complexity of the back-end machine learning algorithm and allow a simpler algorithm to be used. “The Mathematics of Big Data” describes the mathematical basis for implementing these approaches.
1/19/2017 update: Read more about MIT Lincoln Laboratory’s approach to developing algorithms that will keep their users productive as new processing technologies evolve in this article from insideHPC.