Tupleware: An Inside Look

By Eugene Wu, PhD Candidate, MIT CSAIL

At the recent ISTC for Big Data annual Research Retreat, Tim Kraska of Brown University presented an in-depth look at Tupleware, an in-memory, read-only distributed computation framework that supports multiple languages.

Typical “cloud” data processing frameworks are designed for clusters that contain thousands of nodes. In practice, however, enterprise deployments contain 10s of nodes. At this scale, traditional cluster problems such as fault tolerance disappear. In addition, enterprise users typically use a variety of languages (e.g., R, C++, Python). A framework that efficiently supports, and optimizes for, these languages would be desirable.

The Tupleware architecture is built on top of LLVM; thus it naturally supports any language that compiles through LLVM (e.g., C/C++, Python, Julia, R, JSON). In addition, there is optimization support for user defined functions (UDFs).

The programming model is very similar to that of Spark. It consists of the notion of a “tupleset,” which represents a source dataset, and operations over a tupleset that are expressed as algebraic monads. The following code snippet computes a version of K-means in Tupleware:

ts = tupleset(“distances.csv”)

ts’ = ts.map(distance)

.map(min)

.combine(reassign)

.reduce(recompute)

.loop(converged).eval()

Functions such as distance and reassign can be arbitrary user-defined functions. The innovation of the framework comes from its suite of four classes of optimizations that co-optimize the general query plan with the UDFs:

  1. Classic query optimizations such as predicate push down and join implementation optimization.
  2. Classic compiler optimization performed by LLVM, such as SIMD vectorization.
  3. Novel compiler optimizations that may have been possible in traditional compilers but made manageable thanks to query-level semantics. For example, determining the appropriate pipelining strategy.
  4. Combined query/compilier optimizations that are only possible given additional statistics about the data (e.g., cardinality estimation).

As an example of a combined optimization, consider the following typical execution strategies:

  1. Volcano model: pull-based iterator execution model.
  2. Pipelined model: execute multiple operators over the same vector of data until a blocking operator. This approach disables vectorization.
  3. Operator-based model: execute each operator to completion over all of the data. This approach allows vectorization, but disables pipelining.

Tupleware can select from the above suite of execution models depending on statics such as whether the query is CPU- or memory-bound:

Current scaling results have been surprisingly good. In the Brown research team’s experiments, they scale the system from 1 to 100 nodes and test weak scaling (increase data in proportion with number of nodes) and strong scaling (dataset stays fixed). In the former, the performance stays flat, and in the latter the query execution time scales linearly with the number of nodes.

Tim Kraska expects that the performance should topple over when scaled to more extreme cluster sizes; however, the results are promising for the targeted deployment sizes of small to medium clusters.

Additional Reading:

Tupleware Project Page at Brown University

Brown CS Blog

“Tupleware: Redefining Modern Analytics,” ISTC for Big Data Blog,  April 24, 2014

This entry was posted in Analytics, Big Data Applications, Big Data Architecture, Computer Architecture, ISTC for Big Data Blog, Tools for Big Data and tagged , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


× one = 1