There has been a “Cambrian explosion” of big data systems proposed and evaluated in the last few years, but relatively little understanding of how these systems or the ideas they represent compare and complement one another.
In enterprise and science situations, “one size is unlikely to fit all.” We see analytics teams running multiple systems simultaneously. However, the highest level of abstraction for interoperability achieved in practice is basically at the file system (e.g., HDFS).
At the same time, there has been some convergence in this area in terms of higher-level data models (relations, arrays) and higher-level computational models (relational algebra, parallel data-flow, iteration, linear algebra).
So the design space seems narrower than the implementation space, suggesting an opportunity to build a common “complexity-hiding” interface to all these seemingly disparate systems to make them easier to compare, easier to use together, and improve overall performance.
We are exploring a common programming model for big data systems, subscribing to three design principles:
- Algebra at the core. We’re less interested in ad hoc engineering solutions that bridge various systems than in identifying and surfacing the primitives, operators, algorithms, and optimization opportunities they share. At the same time, we want to avoid “regressing to the mean” and destroying any competitive advantages or unique capabilities each system may offer.
- Parallel at the core. We’re less interested in general purpose (serial) programming than in “intrinsically parallel” abstractions. For example, “calling out to R” as an intermediate step is not going to work.
- Iteration at the core. We need to support multi-pass algorithms to express analytics tasks; first-order queries aren’t enough.
So what properties should a “good” algebra have? We see three:
- It should be expressive – can we capture the computations needed by the audiences we wish to serve?
- It should preserve intent – mappings from the surface language syntax shouldn’t be so convoluted as to be impossible to recognize or require many person-years to implement.
- It should be analyzable – the operator definitions should be simple enough to expose equivalences and afford transformation rules; we need to reason about expressions, not just implement them directly.
Motivated by these ideas, we are working to expand the University of Washington’s MyriaQ system as a shared algebra, programming model, and optimizer for multiple back-end systems, exploring at least the following extensions:
In addition to support for iterative relational algebra models and support for reasoning about related data models, including graphs, arrays and matrices:
- Support for compilation to other fast back-end systems, supplementing our own back-end system MyriaX with support for SciDB’s general arrays, linear algebra libraries for fast matrix computations, and a parallel C++ platform called Grappa for graph processing and other irregular (i.e., skewed) applications.
- Support for new surface languages (AFL) and language embeddings (R, Python)
- Support for new system-to-system parallel data exchange operators between, in particular, SciDB and Myria
- Support for cross-platform execution and optimization for a single query
Motivation via analogy to LINQ
LINQ represents a comparable point in the design space, and it is perhaps instructive to compare our vision to that of LINQ.
The power of the LINQ architecture is that it is indeed “algebra at the core,” with an algebra that standardizes manipulation of ordered collections. While there are various query syntaxes for particular languages (C#, VB, JS, etc.), a LINQ query is really just a sequence of methods (i.e., operators) against collections. Moreover, execution of those methods in the language runtime merely constructs an algebraic representation of the query (rather than actually doing the evaluation), which can then be shipped to any LINQ provider that implements the set of methods for evaluation.
Good things about LINQ
1. It has a fairly small algebra – around 22 groups of operators, plus some conversion methods to get collections of specific types.
2. It supports “in-line” user-defined functions (UDFs). Many of the operators are higher-order. That is, one or more arguments may be a function. In at least the C# embedding, such a function can be an anonymous method defined in line using essentially lambda notation. You don’t have to add UDFs through some registration mechanism, and they don’t have to be written in the implementation language of the data manager. This facility depends on any LINQ provider having access to the Microsoft CLR for executing functions.
3. The intermediate form is an expression tree, providing a nice structure for optimizers, schedulers, interpreters, etc., to manipulate. You don’t have the awkwardness of constructing and sending language strings (or worse, sequences of commands) to the data manager when writing applications.
4. Since the SQL-like syntax is actually just a language veneer over the actual operators, the semantics of the language are reasonably precise: compositions of collection operators. Contrast this situation to actual SQL where the semantics (or lack of them) are defined against the surface syntax, often with remarkable lack of clarity.
5. Having the collection algebra being the actual standard makes it relatively easy to construct a LINQ-provider API over an existing data manager. There are over two dozen LINQ providers currently.
6. Result of a query is a collection – you don’t need a cursor mechanism to work further with the answer.
7. There is a low impedance mismatch between queries and the host language. At least in the .NET ecosystem, there is a common type system. Thus you can have compile-time checking of LINQ queries, and good IDE support for queries. (The latter may depend on the IDE being able to access the schema of the data source you plan to query.)
Limitations of LINQ for Big Data
1. The algebra is based on an (ordered) collection model. So while it can work with 1-D vectors to some extent, it can’t do anything very sophisticated with multi-D arrays or graphs. A LINQ provider over an array or graph data manager would only be able to expose its contents as collections of elements.
2. The only iteration is inside the operators. If you want repeated execution of some algebra expression, it has to happen in the host language.
3. Data flow = control flow. If you want to use two LINQ providers in an application, any flow of information between them has to happen in the host language. You can’t specify direction transmission or pipelining between different data managers.
4. LINQ is not “parallel at the core.” At the algebra level, there are transformations and optimizations related to parallel processing that transcend any one particular back-end system. In particular, it’s valuable to have a language/algebra layer that can reason about how data is partitioned and the number of shuffle steps required independently of which system(s) will execute the query, and to inform the programmer / algorithm designer of the resulting tradeoffs.
5. No standard way of reasoning about the execution of the query back to the client ─ monitoring, debugging, exception handling, performance profiling. These realities motivate a richer back-end standard API than just “query” and “response.”
6. .NET is of course a proprietary framework and requires use of .NET languages.
This discussion suggests that working on an “algebra at the core” system for big data that overcomes these limitations is a good direction to move in. The UW MyriaQ system is on this path, with progress on 2, 4, 5 and 6. We anticipate that dealing with 1 requires a broader algebra, perhaps with “standard” and “optional” parts. Handling 3 means orchestrating interaction between data managers without running all the data through the host application; here again MyriaQ has recently included support for federated queries and the orchestration primitives these imply.
Myria: Big Data Management as a Service
Myria is a hosted Big Data management and analytics service consisting of three components:
- MyriaX: A big-data execution engine emphasizing asynchronous iterative processing and state-of-the-art algorithms.
- MyriaQ: A language translation layer, shared optimizer, and query execution coordinator supporting a) multiple input languages and b) multiple back-end systems, of which MyriaX is one. We also support a new imperative, iterative, relational language called MyriaL.
- MyriaWeb: A web-based editor and IDE designed for direct use by analysts, with support for collaborative editing through a shared workspace, visual performance analysis and debugging, and interactive algorithm design rather than just “fire and forget” query submission. (Figure 1)
In the context of this effort, we are working first on extending the existing MyriaQ system. Figure 2 illustrates the components we are actively developing.
* Andrew Whitaker is a Postdoc in the Department of Computer Science and Engineering at the University of Washington. He graduated with a BS in Computer Science from Indiana University in 1999, and a Ph.D. in Computer Science from the University of Washington in 2005. His current interests include operating systems, networking, and computer science education. He is also exploring the applications of computer technology for developing regions.