Here is a continually updated list of software we have developed as a part of the ISTC for Big Data. To learn more about any piece of software and download the code, please click on the link.
Get the code.
This is a massively parallel query engine that enables sub-second approximate queries on very large data. This is a collaboration between researchers at the University of California, Berkeley’s AMPLab and MIT CSAIL.
DataHub is an experimental hosted platform (GitHub-like) for organizing, managing, sharing, collaborating, and making sense of data. It provides easy to use tools/interfaces for managing your data (ingestion, curation, sharing, collaboration); using others’ data (discovering, linking); and making sense of data (query, analytics, visualization).
GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.
Graphulo is a Java library for the Apache Accumulo database delivering server-side sparse matrix math primitives that enable higher-level graph algorithms and analytics.
H-Store is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications. It is a highly distributed, row-store-based relational database that runs on a cluster on shared-nothing, main memory executor nodes. The H-Store project is a collaboration between MIT, Brown University, Yale University, and HP Labs.
Anti-caching is an new architecture for on-line transaction processing (OLTP) systems that overcomes the restriction that data in a main-memory data management system must fit in main memory. An anti-caching prototype has been implemented using the H-Store main-memory, parallel database management system. The anti-caching project is a collaboration between Brown University and MIT.
Julia is a fresh approach to technical computing that is being increasingly used for big data algorithms because of its performance, large amount of functionality, and flexible development environment. Julia was brand new when the ISTC started and has become extremely popular, with a doubling rate in number of users every nine months. Download the documentation and keep up with Julia on the JuliaCon Blog.
Myria is a big data stack that provides efficient data management and analytics capabilities using its own MyriaX shared-nothing query execution engine as well as other engines that it federates under a single query optimizer. Myria users express their analysis and data management through a combination of declarative queries and user-defined Python code. Myria is available as a cloud service that users can access through their browsers and Jupyter notebooks.
Today’s pricing models and SLAs are described at the level of compute resources (instance-hours or gigabytes processed). This makes it difficult for users to select a service, pick a configuration, and predict the actual analysis cost. To address this challenge, we propose a new abstraction, called a Personalized Service Level Agreement (PSLA), where users are presented with what they can do with their data in terms of query capabilities, guaranteed query performance and fixed hourly prices. Source to generate these PSLAs using our PSLAManager is here.
Perceptual kernels are distance matrices derived from aggregate perceptual similarity judgments. The kernels provide a useful operational model for incorporating empirical perception data directly into visualization design tools, enabling the creation of visualizations that better reflect patterned structures (relations) in data. To encourage the integration of perceptual kernels into visualization design tools, Interactive Data Lab at the University of Washington has made its perceptual kernels and experiment source code publicly available.
PipeGen allows for efficient, automatic data transfer between DBMSs for hybrid analytics. PipeGen targets data analytics workloads on shared-nothing engines, and supports scenarios where users seek to perform different parts of an analysis in different DBMSs or want to combine and analyze data stored in different systems. Experiments show that PipeGen delivers speedups up to 3.8x compared with manually exporting and importing data across systems using CSV. Read the latest blog post by the PipeGen team at the University of Washington.
This is a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures such as Intel Phi.
TileDB and GenomicsDB
TileDB is a novel efficient data management system for scientific data, such as graphs, DNA sequences, matrices, maps, and imaging. It makes managing scientific data fast and easy. GenomicsDB is an adaptation of TileDB for genomics. Both have been released as open source code (under the MIT license). TileDB is used by the Broad Institute, one of the most important genomics institutes in the world, as part of its production pipeline. Read the blog post.