VLDB 2013: ISTC Faculty Members to Present Keynote and Five Papers

ISTC for Big Data faculty members and their students will present five papers at the 39th International Conference on Very Large Data Bases, August 26 to 30, 2013, in Riva del Garda, Trento, Italy. In addition, ISTC for Big Data co-director Sam Madden will deliver a keynote address.

Keynote Address: The DataHub

The keynote address, “The DataHub: A Collaborative Data Analytics and Visualization Platform,” will introduce DataHub, a hosted interactive data processing, sharing, and visualization system for large-scale data analytics that is now being built at MIT. Key features include: flexible ingest and data cleaning tools;  a scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets; and an interactive visualization system that is tightly coupled to the data processing and lineage engine. Finally, Datahub is a hosted data platform, designed to eliminate the need for users to manage their own database.


“Hadoop’s Adolescence: An Analysis of Hadoop Usage in Scientific Workloads.” Kai Ren (Carnegie Mellon University); YongChul Kwon,  Magdalena Balazinska, and Bill Howe (University of Washington).

The authors analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. The authors’ analysis suggests that Hadoop usage is still in its adolescence. Overall, they find significant opportunity for simplifying the use and optimization of Hadoop, and make recommendations for future research. (For a more detailed summary of this paper, see this excellent blog post by Magda Balazinska.)

Counting with the Crowd.” Adam Marcus, David Karger, Sam Madden, Robert Miller, and Sewoong Oh (MIT CSAIL).

The authors address the problem of selectivity estimation in a crowdsourced database. Specifically, they develop several techniques for using workers on a crowdsourcing platform like Amazon’s Mechanical Turk to estimate the fraction of items in a dataset (e.g., a collection of photos) that satisfy some property or predicate (e.g., photos of trees). The authors find that for images, counting can reduce the amount of work necessary to arrive at an estimate that is within 1% of the true fraction by up to an order of magnitude, with lower worker latency.

Processing Analytical Queries over Encrypted Data.” Stephen Tu, M. Frans Kaashoek, Sam Madden, and Nickolai Zeldovich (MIT CSAIL).

Monomi securely executes analytical workloads over sensitive data on an untrusted database server. It works by encrypting the entire database and running queries over the encrypted data. Monomi introduces split client/server query execution, which can execute arbitrarily complex queries over encrypted data, as well as several techniques that improve performance for such workloads, a designer for choosing an efficient physical design at the server for a given workload, and a planner to choose an efficient execution plan for a given query at runtime.

Scorpion: Explaining Away Outliers in Aggregate Queries.” Eugene Wu and Sam Madden (MIT CSAIL).

Scorpion is a system that takes user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples used to compute the selected outlier results. This explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, the authors design algorithms that efficiently search for maximum influence predicates over the input data. The authors show that these algorithms can run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set.

“A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis.”  Matthew Moyers, Emad Soroush, Spencer C. Wallace, Simon Krughoff,  Jake Vanderplas, Magdalena Balazinska, and Andrew Connolly (University of Washington).

The authors introduce AscotDB, a new tool for the analysis of telescope image data. AscotDB results from the integration of Ascot, a web-based tool for the collaborative analysis of telescope images and metadata from astronomical telescopes, and SciDB, a parallel array processing engine. The authors demonstrate the novel data exploration supported by this integrated tool on a 1-TB dataset comprising scientifically accurate, simulated telescope images.

About VLDB

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users. Data management and databases remain among the main technological cornerstones of emerging applications of the 21st-century.

This entry was posted in Analytics, Big Data Architecture, ISTC for Big Data Blog, Query Engines, Tools for Big Data, Visualizing Big Data and tagged , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

5 × three =