By Dylan Hutchison, Bill Howe, Dan Suciu, and Zachary Tatlock, University of Washington

There has been a “cambrian explosion” of systems and languages for large-scale data analytics: Postgres and H-Store accept SQL queries; Datomic and Myria accept Datalog; SciDB accepts AQL and AFL; Lyra and Voyager accept Vega; Grappa and CombBLAS accept C++ API calls; and Accumulo and Spark accept Java API calls.

It’s clear that “one size does not fit all”: no one system will support every data type and every style of computation. On the other hand, individual tasks may increasingly make use of multiple kinds of data and computation. The MIMIC medical data set, for example, includes high-frequency blood pressure time series measurements as well as relatively static patient demographic and dietary data. These data are difficult to process on any single system due to their heterogeneity, yet analyzing these data together is critical to answering questions such as understanding how fluid intake affects blood pressure. *Polystores* are environments with access to multiple systems useful for different kinds of data processing. Insulating application programmers from the heterogeneity of polystores while exploiting the specialization opportunities offered by their underlying systems is a key goal for the ISTC for Big Data.

Insulating application programmers from the heterogeneity of polystores while exploiting the specialization opportunities offered by their underlying systems is a key goal for the ISTC for Big Data

The process of **polystore optimization** is to synthesize an effective multi-system execution plan from a given query.

In this post we consider a representation (Program Expression Graphs, or PEGs) for polystore execution plans that naturally affords an algorithm (Equality Saturation) for searching through equivalent execution plans across data models. ^{[1]} Check out this February blog post for another perspective on polystore optimization, including an approach for learning cost models via black-box techniques. See Surajit Chaudhuri’s PODS paper for an overview of single-system cost-based relational database optimization.

**Why Use Equality Saturation for Polystore Optimization?**

In the context of the RACO polystore system, we have built and deployed a conventional rule-based optimizer. But as we continue to add new backends, new rules and new applications, we are reaching the limits of this initial design. Polystore query processing requires reasoning across multiple data models, which means more operations, more rules and more control flow logic to orchestrate execution (e.g., iteration). As a result of this complexity, the software engineering required to maintain a rule base becomes a significant challenge.

We are experimenting with a more principled approach to optimizer design, borrowing ideas from the compiler community. In particular, database optimizers in practice heavily depend on heuristics (e.g., left-deep plans) and rule order interactions. PEGs offer a compact representation of a set of plans that cleanly separate plan generation from plan selection and avoid rule ordering issues.

We further anticipate the structure behind Equivalence PEGs (E-PEGs) will aid rule debugging, which is important in a polystore environment where many different systems (and programmers) interact with a common rule base. “Why did the optimizer decide on this plan?” and “What other plans did the optimizer consider?” are questions that an E-PEG representation may help answer.

** Representing Equivalent Computation with E-PEGS**

The first step toward polystore optimization is to find a representation of computation within and across the underlying systems. We start by modeling computation as a rooted expression tree. The tree’s leaves are input objects, such as relational tables or files. The tree’s internal nodes are operators whose children are arguments. Program Expression Graphs (PEGs) are a class of expression trees^{[2]} we further explore here; see below for several PEG examples.

Relational database optimizers generate equivalent computations by applying rules to rewrite parts of the tree. An optimizer may recognize a relational select on keys in the group by operator of an aggregation that can be pushed down below the group by, for example, and so the optimizer may choose to switch the order of the select and aggregate operators in an expression tree. Rewriting trees in this way typically depends on the order rules are applied.

We adopt the idea of Equivalence-PEGs (E-PEGs) to avoid rule ordering and related problems. When a rule deems two PEGs equivalent in the sense that they output the same object, we marry their roots into an *equivalence class* maintained inside an E-PEG. By induction we see that E-PEGs encode any number of PEGs, grouping every operation and object with equivalent PEG subtrees inside equivalence classes.

We also appreciate E-PEGs because they admit algorithms for efficient pattern matching (the Rete algorithm) and global cost-based optimization (conversion to integer linear programming). Efficient pattern matching means efficient recognition of the conditions to fire an equivalence rule among the large library of rules polystores require. Global optimization is an alternative to the heuristics that enable local dynamic programming-based optimization. While many optimizers use such heuristics to reduce the search space, we anticipate they will become unmanageable in the context of polystores, and so we are exploring E-PEGs as a cleaner approach to organize a complex optimization process.

The pattern matching and global optimization are part of a procedure called **equality saturation**: to “saturate” an E-PEG with more and more equivalent PEG subtrees by firing equivalence rules while they apply to parts of the E-PEG and extract the best PEG found. See Tate et al’s POPL paper for more details on equality saturation in the context of compilers for general purpose languages.

**An Example of E-PEGs in Action**

Suppose we have data redundantly stored in two systems: a relational table T with attributes ‘row’, ‘col’ and ‘val’ in Postgres and a corresponding matrix M in CombBLAS. A user may write the relational query

select sum(val) from (select sum(val) from T group by row)

and expect a polystore optimizer to find the best way to execute it on either Postgres, CombBLAS, or a combination of the two. A parser interprets the query as the initial plan:

One way an optimizer can rewrite the initial plan is by combining the aggregations:

Another way to rewrite the initial plan is to recognize that each aggregation is equivalent to a one-dimensional matrix reduction, implying that we can obtain the same answer by reducing M through CombBLAS’s Reduce function:

For many database optimizers, the order in which we apply the two rewrites above determines the outcome of optimization. If we apply the rule combining relational aggregation first, then the resulting plan has no clear rewrite to CombBLAS operations because CombBLAS only has functions that reduce matrices one dimension at a time, whereas the combined aggregation reduces two dimensions at once. If we apply the rule rewriting relational aggregation as matrix reduction first, then the resulting plan has no clear rewrite to a combined aggregation.

An optimizer using the equality saturation algorithm generates the more flexible E-PEG shown below. Dashed boxes group subtrees with the same output in equivalence classes. The final optimization step is to select a particular PEG inside the E-PEG for execution by choosing members of equivalence classes and their children. We make these selections relative to a cost model.

While this example may be small and it ignores details such as data format and transfer, it demonstrates how E-PEGs avoid the rule ordering problem by considering all equivalent plans at once through the use of equivalence classes. The decision on which plan to choose among the many options inside an E-PEG depends on cost alone, not rule order engineering.

To recap, equality saturation on E-PEGs is a candidate to solve the “enumerate equivalent plans” portion of polystore optimization. Two remaining challenges are to create a body of equivalence rules covering the landscape of data models in a polystore, and to create a cost model flexible enough to compare operations from any system in a polystore. We discuss the former challenge in more detail below.

**Equivalence Rules**

We encode equivalence between two PEGs via equivalence rules acting like two PEG patterns on each side of an equality symbol. Given an E-PEG, when we recognize a PEG matching the pattern on one side of the equality, we “fire” the rule by adding the PEG on the other side of the equality to the E-PEG.

Equivalence rules have a few flavors:

*Domain rules*within one data model, such as combining aggregation in our example above. Many domain rules come from identities in various branches of mathematics like relational algebra and abstract algebra. Properties such as commutativity, idempotency, and distributivity play a large role.*Translation rules*between two data models, such as equating matrix reduction with relational aggregation in our example above. These rules are generally not one-to-one; a matrix multiply, for example, has the same semantics as a join, apply, and aggregation with group by.*Implementation rules*are a subclass of translation rules that equate physical and logical operators.*Structural rules*for special operators that exist in every data model, such as loop variables and conditional branches. See the equality saturation paper for more detail.

A few more words on translation rules. The set of rules we gather must be *sound*─preserving the semantics of computation across each equivalence─but need not be *complete*─having a mapping to every operation in a data model. As a natural consequence of the specialization of data processing systems, we expect some data models have operations that do not equate to any combination of operations from other data models. Users can still call such an operation by writing it directly in their input scripts. This “trap door” allows the user to call operations the optimizer knows nothing about.

Translation rules may include data transfer and conversion operators. For example, CSV files broker data transfer between relational tables in MyriaX and SpParMat objects in CombBLAS by including a Myria export operator and a CombBLAS ReadDistribute function call. Since most all systems support export to and import from CSV files, the file system is an easy, naive way to transfer data between systems. Other projects in the ISTC for Big Data, including Brandon Haynes’ PipeGen and Adam Dziedzic’s Portage, are exploring more efficient data transfer methods such as synthesizing sockets for direct binary transfer.

**Writing Equivalence Rules**

Writing many equivalence rules is a crucial step toward a viable E-PEG approach because the set of known rules directly determines the space of equivalent computations we consider. One important source of rules is manually writing them. Indeed we plan to manually write many rules, but we recognize that manual methods will not scale to increasingly many data models.

Another source of rules is transitive closure: to compose existing rules into a new rule. In fact, the equality saturation procedure already takes the transitive closure by continuously applying more and more rules. Encoding a “super rule” that is the composition of a million rule applications allows us to realize their final equivalence in a single step. Super rules are not necessary for theoretical correctness, but we predict they will be crucial for engineering efficient saturation. On the other hand, we cannot store every composition of rules because the number of compositions grows combinatorially or even infinitely for rules that add extra operations or constants. Careful control over our rule-set and possibly the rule-application process is essential.

Theorem provers are a third source for rules. Automated theorem provers conclude rules by way of SMT, resolution and other strategies. Interactive theorem provers can prove more complex equalities at the expense of human involvement. Both classes of theorem proving open a door toward verified polystore optimization, a tantalizing proposition we will defer until we’ve pushed manually-written rules as far as we can.

A fourth source for rules is random testing: guess equivalence rules and test them on generated data. If such a rule holds in the face of many random trials and carefully constructed corner cases, then the rule holds for all data with good probability. For more background, see the QuickSpec tool or reach out to Zuohao She, a student at Northwestern University driving the Translating Equivalent Database Implementations (TEDI) project.

**The Role of Logical Data Models**

Writing translation rules between systems’ physical operators is difficult in general. A MyriaSymmetricHashJoin, for example, requires its two arguments to be hashed on joined attributes using the same hash function. What would a rule relating it to the BlasMult_AnXBn_Synch call of CombBLAS look like?

Instead of directly translating physical operators, we decompose their translation into the composition of rules that translate through logical data models. Using a logical approach, we equate MyriaSymmetricHashJoin with relational algebra operators, then relational algebra with a GraphBLAS-style array algebra, then array algebra with CombBLAS function calls. Writing rules for each stage of translation takes less effort than direct physical translation and permits rule reuse─using rules between logical data models to facilitate translation between many related physical systems.

Unfortunately, some rules are difficult to write even between logical data models like the relational algebra, array algebra, and key-value algebra used by many NoSQL databases. To further ease the burden of translation, we are developing a new logical data model called Lara which underlies and unifies relational, array and key-value algebras in order to facilitate translation between them and ultimately between their governed physical systems. We defer Lara’s details to a future blog post. The curious reader may peek at a working draft on Lara’s design in Arxiv.

**All Systems in One and One System for All**

Big data analytics demand multi-system computation. We propose representing computation with Program Expression Graphs and using equivalences to search for and represent different implementations of a given computation. We highlight candidate approaches to generate equivalence rules and emphasize the role of logical data models in facilitating translation between systems. Adding a cost model, perhaps learned through black-box techniques, is the final step to selecting a computation for execution.

Today we traverse the frontier of polystore optimization. Tomorrow we will unite the best data processing systems under a gloriously federated banner.

^{[1]} The term data model refers to an algebra: a set of objects and operations on those objects. The metaphor island from the BigDAWG architecture is also synonymous, though we emphasize islands are not as isolated as the metaphor implies. Below we show how translation rules bridge islands.

^{[2]} PEGs contain loops for iterative computation.

*Acknowledgements*

Thanks to Brandon Myers of the University of Washington and Jennie Duggan of Northwestern University for their feedback on a draft for this blog post.