Big Data New Year’s Resolutions

Five Things to Watch for in 2013


By Sam Madden, Ph.D. 

Happy New Year to the global Big Data community.  Since it’s that time of year for resolutions, here are some of the things we here at the ISTC for Big Data resolve to do in the coming year.

Resolution #1:  Deliver Scalable Algorithms for Big Science

We resolve to deliver scalable algorithms for genomics and big science.  Working with biologists at the Broad Institute at MIT, we’re developing a benchmark of important array-based algorithms for analyzing and comparing genomes.  This is a first step towards understanding how database systems can be used to make sense of this kind of data.  In addition, we are working to parallelize a number of such algorithms, including classical array algorithms like regression and bi-clustering, as well as more recent algorithms for similarity detection like locality-sensitive hashing [LSH].

Resolution #2:  Keep Hosted Data More Secure

We resolve to keep your hosted data more secure, while preserving your ability to query it.  In particular, we’re building a system to allow you to run large analytic SQL queries on top of encrypted data, based on extensions to the CryptDB work from Popa et al presented at last year’s ACM Symposium on Operating Systems Principles (SOSP).   These extensions allow us to perform large sequential scans over encrypted data without dramatically increasing the size of the data, resulting in slowdowns of only about a factor of two versus unencrypted data on TPC-H Scale 10–100.

Resolution #3:  Identify Data Outliers More Easily

We resolve to help you understand where the outliers in your data came from – and why.   We’re working on systems that visualize the result of SQL aggregate queries, and allow you to flag certain outputs as outliers that have unexpected values.  The system then suggests common properties of the records that comprise these outliers, so that you can understand why your data looks the way it does.  Figure 1 shows an example of this with some data from a collection of sensors deployed in an indoor lab setting.

Figure 1: In a plot of temperature versus time averaged over many sensors, the user selects some time windows that are outliers (left plot). The system then suggests some common properties of the records that comprise those outliers, and allows the user to filter them from the input (right plot). Here the high variance readings are caused by one malfunctioning sensor (moteid 15). When its values are removed from the input, the high variance windows disappear.

Resolution #4:  Integrate Humans into Big Data Processing Systems

As a part of the Qurk project at MIT, we’ve been building systems that make it possible to use people on services like Amazon’s Mechanical Turk to populate, filter, and rank database tables and query results.  We’ve got several exciting new research results coming this year, including a study of methods for using humans to process aggregations queries involving averages and counts over collections of images and text, and techniques to use humans to efficiently train classifiers that can assign elements in a database to one or more categories.

Resolution #5:  Build Super-Scalable Interactive Visualizations

We resolve to build super-scalable interactive visualizations of big data, using sampling and parallel data processing and rendering capabilities of modern graphics processing units (GPUs).  We’re working on systems that systematically subsample data to improve interactivity of visualizations of millions or billions of data points and also to improve the readability of visualizations by showing only the values that matter rather than covering the screen with vast numbers of points.  We’re also working on using GPUs to further improve the interactivity of these visualizations, including building a column-oriented database system to push certain compute-intensive database operations into hardware.

What do you resolve to do with your data this year?

This entry was posted in Analytics, Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog, Math and Algorithms, Tools for Big Data, Visualizing Big Data and tagged , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

− 3 = five