Our Research | Software

Here is a list of software we developed as a part of the ISTC for Big Data (a five-year, Intel-backed project that ran 2012-2017).  To learn more about any piece of software and download the code, please click on the link.

Featured Software

ModelDB Featured SW Promo 080917

Get the code and documentation. Watch a video. Read a blog post.


BigDAWG is a polystore system for simplifying integration and analytics of disparate data at scale. The BigDAWG architecture consists of four distinct layers: database and storage engines; islands; middleware and API; and applications. The initial release supports three open-source database engines–PostgreSQL (SQL), Apache Accumulo (NoSQL), and SciDB (NewSQL)–along with support for relational, array and text islands. Of interest to anyone seeking a simpler way to use data that spans multiple data models and data stores, such as research analysts, data scientists and database administrators.  Read the blog post.


This is a massively parallel query engine that enables sub-second approximate queries on very large data.  This is a collaboration between researchers at the University of California, Berkeley’s AMPLab and MIT CSAIL.


DataHub is an experimental hosted platform (GitHub-like) for organizing, managing, sharing, collaborating, and making sense of data. It provides easy to use tools/interfaces for managing your data (ingestion, curation, sharing, collaboration); using others’ data (discovering, linking); and making sense of data (query, analytics, visualization).


GraphLab is a graph-based, high performance, distributed computation framework written in C++.  While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.


Graphulo is a Java library for the Apache Accumulo database delivering server-side sparse matrix math primitives that enable higher-level graph algorithms and analytics.


H-Store is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications. It is a highly distributed, row-store-based relational database that runs on a cluster on shared-nothing, main memory executor nodes. The H-Store project is a collaboration between MITBrown UniversityYale University, and HP Labs.

Anti-Caching Extension for H-Store

Anti-caching is an new architecture for on-line transaction processing (OLTP) systems that overcomes the restriction that data in a main-memory data management system must fit in main memory. An anti-caching prototype has been implemented using the H-Store main-memory, parallel database management system. The anti-caching project is a collaboration between Brown University and MIT.


Julia is a fresh approach to technical computing that is being increasingly used for big data algorithms because of its performance, large amount of functionality, and flexible development environment.  Julia was brand new when the ISTC started and has become extremely popular, with a doubling rate in number of users every nine months. Download the documentation and keep up with Julia on the JuliaCon Blog.


Myria is a big data stack that provides efficient data management and analytics capabilities using its own MyriaX shared-nothing query execution engine as well as other engines that it federates under a single query optimizer. Myria users express their analysis and data management through a combination of declarative queries and user-defined Python code. Myria is available as a cloud service that users can access through their browsers and Jupyter notebooks.


Companies often build hundreds of machine learning (ML) models a day (e.g., churn, recommendation, credit default). However, there is no practical way to manage all the models that are built over time. This lack of tooling leads to insights being lost, resources wasted on re-generating old results, and difficulty collaborating. ModelDB is an end-to-end system that tracks models as they are built, extracts and stores relevant metadata (e.g., hyperparameters, data sources) for models, and makes this data available for easy querying and visualization.


Today’s pricing models and SLAs are described at the level of compute resources (instance-hours or gigabytes processed). This makes it difficult for users to select a service, pick a configuration, and predict the actual analysis cost. To address this challenge, we propose a new abstraction, called a Personalized Service Level Agreement (PSLA), where users are presented with what they can do with their data in terms of query capabilities, guaranteed query performance and fixed hourly prices.  Source to generate these PSLAs using our PSLAManager is here.

Perceptual Kernels for Automating Visualization Design

Perceptual kernels are distance matrices derived from aggregate perceptual similarity judgments. The kernels provide a useful operational model for incorporating empirical perception data directly into visualization design tools, enabling the creation of visualizations that better reflect patterned structures (relations) in data. To encourage the integration of perceptual kernels into visualization design tools, Interactive Data Lab at the University of Washington has made its perceptual kernels and experiment source code publicly available.


PipeGen allows for efficient, automatic data transfer between DBMSs for hybrid analytics. PipeGen targets data analytics workloads on shared-nothing engines, and supports scenarios where users seek to perform different parts of an analysis in different DBMSs or want to combine and analyze data stored in different systems. Experiments show that PipeGen delivers speedups up to 3.8x compared with manually exporting and importing data across systems using CSV.  Read the latest blog post by the PipeGen team at the University of Washington.

ScaLAPACK for Phi

This is a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures such as Intel Phi.

S-Store Streaming OLTP System for Big-Velocity Applications

S-Store is the world’s first streaming OLTP engine, which seeks to seamlessly combine online transactional processing with push-based stream processing for real-time applications. It includes an API based on Java and SQL, which facilitates the creation of dataflow graphs of computations operating over both streaming and stored datasets. It’s been tested on real-life use cases, including real-time alert monitoring and streaming ETL.

TileDB and GenomicsDB

TileDB is a novel efficient data management system for scientific data, such as graphs, DNA sequences, matrices, maps, and imaging. It makes managing scientific data fast and easy.  GenomicsDB is an adaptation of TileDB for genomics. Both have been released as open source code (under the MIT license). TileDB is used by the Broad Institute, one of the most important genomics institutes in the world, as part of its production pipeline. Read the blog post.

Vega, A Visualization Grammar

Vega is a declarative format for creating, saving and sharing visualization designs. With Vega, visualizations are described in JSON, and generate interactive views using either HTML5 Canvas or SVG.  Vega offers a full declarative visualization grammar, suitable for expressive custom interactive visualization design and programmatic generation. Higher-order visualization tools built on Vega include Vega-Lite, Lyra, and Voyager.