Reflections on the First Year of the ISTC for Big Data

ISTC-Sam-Madden

By Sam Madden, MIT CSAIL

A year ago, Intel announced that our team had been selected to host the Intel Science and Technology Center in Big Data.  This seemed like a good opportunity to reflect on some of the awesome things we’ve achieved this year, and remind our readers of some of our past blog content.

– We are building a great team, including professors and students at six universities plus many awesome collaborators at Intel.  At MIT, we just hired research scientist Nesime Tatbul, an Intel researcher embedded at MIT, and have hired several post docs and researchers, including Todd Mostak, developer of MapD.

– We are working hard to develop benchmarks and collect data sets for users of Big Data.  See our list of data sets and check out some of our work on benchmarking. This year we devised a genomics benchmark focused on array operations over microarray data.

– A big focus is on developing database systems for scientific users.  ISTC Researchers David Maier and Stan Zdonik are developing a new query language standard for array database operations, while ISTC Researcher Magda Balazinska and her team developed new technologies for running iterative operations over arrays and for efficiently accessing historical versions of arrays.

– We made a big push on scalable data visualization, with several of our teams developing innovative technologies in this area.  In their work on scalable prefetching, Justin DeBrabrant and Leilani Battle built a system for efficiently determining what data to display next to users. Todd Mostak built MapD, a system that uses many-core and GPU hardware to efficiently process analytic SQL queries and visualize their results on a map;  his system makes it possible to interact with hundreds of millions of data points on a map in real time.   Eugene Wu, a graduate student, developed DBWipes, a system for connecting visualizations with the data sets that underlie them, so that outlier data points in graphs and charts can be linked to the data points that contributed to them, and those data points can be ranked using a notion of influence that he developed.

– We’ve built many other cool things.  One that I particularly like is called anti-caching.  Here the idea is that as main memory sizes increase in transactional databases, it is becoming more and more likely that these databases will fit entirely in main memory.  As our work on the H-Store project has shown, a transactional database system optimized for main memory operation can be orders of magnitude faster than traditional systems that were optimized to assume data resides on disk and treat main memory as a cache for disk pages currently being operated on.  In anti-caching, the idea is that memory is the primary store for data, but that when a database system needs to access a few more pages than will fit in memory, some pages from memory can be cached (spilled) on disk.  The advantage of this is that it preserves many of the performance advantages of main-memory optimized systems like H-Store while making it possible to operate on data sets that exceed the capacity of RAM.

For more information about the goals and our work, see our recently published overview paper in SIGMOD Record, and look over the past year’s articles in this blog!

 

 

This entry was posted in Analytics, Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog, Tools for Big Data, Visualizing Big Data and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


two − 1 =