At VLDB 2015, the ISTC for Big Data team presented a demo of the BigDAWG polystore system. This blog post highlights some of the research challenges we are exploring as we build this novel system. As we extend the VLDB prototype, we are investigating how to optimize queries and model database performance in a polystore.
Polystore systems are database federations that are designed to support many disparate data models. They unite many, diverse query processing engines using islands of information. Each island is a user-facing abstraction that informs the polystore how to interpret all or part of a query. An island has a data model, query language, and shims to connect the federator to backends that support some or all of the island’s functionality. Examples of islands include ones for graph and array analytics.
Some islands are degenerate, and they represent the complete functionality of one of the federator’s databases. When taken together, a collection of islands offers semantic completeness to the user – this system has the same expressiveness as the union of its underlying engines without miring the user in details about where or how their query will be run. Islands are composable, hence multiple islands may be referenced in a single BigDAWG query. Each island is enclosed in a SCOPE command. For example, a polystore user might write a query
ARRAY(multiply(A, RELATIONAL(SELECT * FROM coords))
to multiply a matrix with tuples from a relational table. For more details on the BigDAWG architecture, please see our blog post on this topic.
Because polystore systems unite many disparate databases, it is simply not tenable to create fine-grained models of each of its engines individually like one would with an analytical cost model in a relational federation. Thus BigDAWG must create black-box models to optimize its query plans. This gives rise to several interesting challenges in building the models that predict query execution performance in a multi-database system without resorting to hand-coded analytical cost models for each island-engine combination.
Polystores efficiently execute workloads comprising diverse queries by identifying “sweet spots” in the performance of the engines that power the federator. We are investigating how to formalize the performance characteristics of many distinctive engines to efficiently identify the strengths and weaknesses of their query processing capabilities. An important first step to solving this problem is finding the minimal set of queries that will enable the BigDAWG’s query optimizer and data placement facility to make decisions that result in high performance for a given query workload.
BigDAWG will monitor the performance of a query workload as it runs on one or more machines. The polystore will use this information to systematically break down a user-provided query workload, assigning queries to classes with a shared expected performance profile across engines. This profile denotes which databases will execute this query class efficiently and the ones to avoid for this part of the workload. To learn a query’s profile, we run it in expansive mode that is akin to the training phase of machine learning applications. This mode executes the query on all of the backends that support its semantics and the federator records its performance on each. Expansive execution may be done all at once – when the query is initially submitted by the user – or opportunistically when slack resources arise in individual databases. This profile is paired with a signature summarizing the query’s structure and data accessed for comparison with other queries it has monitored.
A polystore query optimizer faces novel challenges in enumerating query plans over many disparate backends and identifying the ones that will deliver the highest performance. Like traditional federated databases, BigDAWG’s optimizer will create a directed acyclic graph of operators for its execution plans. Planning polystore queries is more complicated than regular federated optimization because BigDAWG needs to support a group of engines with partially overlapping capabilities. Moreover, individual engines may offer dramatically different performance when they are able to run the same queries.
One key challenge in optimizing polystore queries is enumerating valid execution plans. In many cases it is possible to reformulate a query from one island to another such that the query results are unchanged, but the user realizes higher performance. In order to reap the advantages of the vertically integrated systems in a polystore as often as possible, the optimizer will maintain an equivalence class for each island operator, or a list of operators in other islands that have the same semantics. These equivalents are interchangeable and we use them to minimize data movement in addition to exploiting vertical integration. In our first BigDAWG release these mappings will be supplied by the user. It is an exciting future research direction to come up with techniques for automatically inferring cross-island operators that are equivalent.
When the BigDAWG optimizer receives a query, it first parses the query to extract its signature. The planner then compares the signature to ones it has seen before and assigns it to a predicted performance profile. BigDAWG then uses this profile paired with the presently available hardware resources on each database to assign the query to one or more databases. As the monitor accumulates measurements about the query’s performance, the optimizer incrementally refines its signature classification and performance estimates. To do this, we are investigating generalizing techniques in query performance prediction. The polystore setting is more challenging than the ones targeted in the literature because the optimizer needs to contend with multiple data models and distributed execution.
In conclusion, the BigDAWG polystore system is exposing many research opportunities in black-box system modeling and query optimization. By profiling and modeling polystore workloads we hope to learn general principles for building multi-data model systems.
ACM SIGMOD blog post, July 13, 2015: “The Case for Polystores”
ISTC for Big Data blog post, August 13, 2015: ISTC to Unveil New Big Data Federation Architecture at VLDB 2015
“The BigDAWG Polystore System.” Jennie Duggan, Aaron Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Samuel Madden, Dave Maier, Tim Mattson, and Stan Zdonik. Sigmod Record, 44(3), 2015
“A Demonstration of the BigDAWG Polystore System.” Aaron Elmore, Jennie Duggan, Michael Stonebraker, Magda Balazinska, Ugur Cetintemel, Vijay Gadepally, Jeffrey Heer, Bill Howe, Jeremey Kepner, Tim Kraska, Sam Madden, David Maier, Timothy Mattson, Stavros Papadopoulos, Jeff Parkhurst, Nesime Tatbul, Manasi Vartak, Stan Zdonik. Proceedings of the VLDB Endowment, 8(12), August 2015