Myria: Big Data Management as a Service

By Magdalena Balazinska, University of Washington

Over the past year, the University of Washington Database Group has developed a new engine for managing Big Data. The system, called Myria, has been tested on 100-node Amazon EC2 deployments and on data from domain sciences including astronomy, as well as from standard benchmarks and from social media sources.

So, why another Big Data management system?

The primary reason for building Myria is our discontent with the performance of Hadoop. We spent the last few years working with the Hadoop stack (see our Nuage project) but, as is widely recognized (see this recent blog post, for example), the core Hadoop engine has
significant performance limitations.

The second reason ― and really the key reason ―  is to build our own platform for Big Data Management research. Our concrete goal is to build a “Big Data Management Service” that meets the needs of today’s users, especially in domain sciences, and to understand what it takes to operate such a service in the Cloud in a manner that is cost-effective and intuitive
for the provider and the users. We plan to deploy a permanent, public Myria service that will enable scientists to analyze and query big data in the browser without installing any software.

What is Myria’s value proposition?

Myria’s core design is similar to many state-of-the-art systems:

  • Myria assumes a structured data model (i.e., records with typed
    fields) and exposes a declarative query language interface. It can
    process data stored in any type of storage layer as long as it is
    given an appropriate wrapper access method. However, Myria favors the
    use of a set of independent relational database management system
    (DBMS) instances for data storage, which provide efficient data
    organization and, most importantly, the ability to index the stored
    data. Myria pushes query processing as much as possible to this
    underlying distributed DBMS layer.
  • On top of this relational storage layer, Myria implements an
    elastically scalable and fault-tolerant dataflow processing engine,
    where compute and storage nodes can be added or removed
    dynamically. Queries take the form of graphs of operators (cycles are
    possible). The engine distributes queries across compute nodes, which
    execute their query fragments and exchange data through distributed
    shuffle operators. A single coordinator monitors the execution of all
    queries in the cluster. An important emphasis of the Myria query
    execution engine is pipelined and non-blocking processing. The engine
    strives to avoid any kind of blocking or synchronization between
    operators. Additionally, similar to recently introduced engines such
    as Spark, Myria enables queries to execute against either
    disk-resident data or data cached in the distributed memory of
    machines in the cluster.

We plan to deploy a permanent, public Myria service that will enable scientists to analyze and query big data in the browser without installing any software.

Myria, however, has several interesting new features that go beyond the basic state-of-the-art:

Query capabilities: Myria’s goal is to provide users with an engine specialized for today’s Big Data management and analytics needs, which include the following fundamental query capabilities: (1) Relational algebra operations, because they capture a large fraction of basic data management needs; (2) User-defined functions (UDFs) that can implement more complex and domain-specific data transformations; (3) Iterative queries necessary for graph algorithms, machine learning applications, data cleaning, and other analysis critical for today’s users. Myria offers these capabilities behind a declarative query interface based on Datalog, since that language makes it easy to compose simple rules into sophisticated iterative programs.

Query optimizer: Through this Datalog-based language, Myria provides a common interface to multiple back-end big data systems, including but not limited to Myria’s execution engine. In particular, Myria’s front-end query compiler can generate either parallel data flows to run on Myria’s query engine or local, in-memory C++ computations. The idea is to use a single system to handle the cross-scale of data processing needs from data that can fit in the memory of one server to data that requires multiple nodes to process.

Core execution engine: Myria’s query execution layer is built on a new design that assumes its resources can either fail or be pre-empted by a global cluster scheduler. Modern Big Data systems execute in clouds shared with a variety of other systems. For example, companies frequently run a parallel database system, Hadoop, a graph processing system, and other engines at the same time. To minimize costs, these engines should effectively share cluster resources rather than requiring separate deployments. Myria embraces this model. It includes lightweight failure-handling capabilities that enable it to scale out when resources are available and scale down when contention arises. All this in the context of efficient, non-blocking processing.

Big Data service: Myria also puts a significant emphasis on its interface to users. For applications, Myria can be programmed and managed completely through a rich REST interface, allowing applications in any language and on any platform to exploit Myria’s capabilities. For end-users, all interactions with the engine are through a Web-based interface. Users need not worry about how many instances they are running. Users only reason about the combination of (1) query capabilities, (2) performance, and (3) hourly price. Similarly, the Myria provider need not worry how many instances to reserve for the service, as the services automatically scale up and down with demand.

Preliminary results demonstrate that Myria is competitive with the best state-of-the-art systems. More information about the Myria engine will be available through our project web site: http://myria.cs.washington.edu

This entry was posted in Big Data Architecture, Data Management, DBMS, ISTC for Big Data Blog and tagged , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


4 × = eight