Intel and the ISTC for Big Data (2012-2017): A Powerful Collaboration

Jeff Parkhurst_Tim Mattson_Intel

By Jeff Parkhurst, Ph.D. and Timothy G. Mattson, Ph.D., Intel 

The year 2012 was arguably the year that Big Data went mainstream. Data was being hailed as a new class of economic asset, similar to currency or gold, from the stage at Davos and in mainstream media. Big Data played a role in the 2012 Presidential election, and the Obama White House launched its $200 million Big Data Research and Development Initiative. IDC reported that the world’s data, estimated at 2.8 ZB (zettabytes), would continue to double every year, although only a fraction of “digital universe” data was currently being explored for analytics.

In other words: 2012 was a perfect time to fire up our newest Intel Science and Technology Center, part of a program of academic collaborations that help us stay ahead of technology developments that present opportunities for our business.

On May 30, 2012, Intel announced that a proposal from MIT had won a national competition to be the sixth Intel Science and Technology Center, one focusing on Big Data  (The new research center was one of several Big Data initiatives announced that day by MIT, Intel and the Commonwealth of Massachusetts.)

The MIT proposal was selected from a field of 150+ candidates. This speaks to both the high level of academic interest in working with Intel to identify and prototype revolutionary technology for Big Data and the quality of the thinking and talent behind the MIT proposal.

Based at MIT CSAIL and directed by Professors Michael Stonebraker and Sam Madden, the ISTC for Big Data included nearly two dozen top academic researchers from six leading universities around the country.

The mission of the ISTC for Big Data was to identify areas in big data that present opportunities for Intel: growing markets for our businesses but also influencing the development of Intel products. Through the ISTC for Big Data, we had a glimpse into future benchmarks and workloads, closely collaborated with visionary academic researchers (including embedding Intel scientists at MIT),  and drove development of new Big Data solutions.

TileDB is a stellar example of a new Big Data technology that emerged from the ISTC…Intel uses TileDB in our collaborative cancer cloud software stack to manage genomic data…An open-source version is also available.

TileDB is a stellar example of a new Big Data technology that emerged from the ISTC.

Array data is one of the fundamental data types in Big Data applications. TileDB is a storage engine for array data.  Depending on the workload, it dramatically beats the state of the art in managing array data. Intel uses TileDB in our collaborative cancer cloud (CCC) software stack to manage genomic data. Open source versions of TileDB and a powerful genomics tool built on top of it (GenomicsDB) are available, a policy for all software emanating from Intel-backed research done under the ISTC umbrella.

Projects at the ISTC had an impact on designing Intel hardware. For example, Professor Srini Devadas’ group at MIT helped develop an efficient hardware indirect memory prefetcher (IMP). Developed as part of an internship at Intel, the IMP learns how much of each accessed cache line is used by a core and requests partial lines where appropriate.

Intel’s Parallel Computing Lab (PCL) conducts ground-breaking research based on analysis of full solution stacks—from the workload or benchmark, to the middleware, all the way down to the hardware. Working with the ISTC, PCL researchers developed key benchmarks that helped them identify potential improvements in our Xeon Phi products.

In late March, the ISTC released the first version of BigDAWG, a polystore system for simplifying integration and analytics of disparate data at scale. This was a problem that we watched steadily grow in magnitude over the last five years. BigDAWG, which stands for “Big Data Working Group,” is the culmination of several years of intense collaborative work by researchers from Intel and several academic institutions. BigDAWG is open-source software and available for download at bigdawg.mit.edu.

Along the way, we regularly reported on our work here on the ISTC blog—sometimes with surprising results. Through the blog, for example, a technologist inside Intel’s data center group learned of our work and found a student from MIT to collaborate with on graph-based datasets, which will provide additional test beds for Intel’s analytics software. The ISTC for Big Data blog will continue to be available here as a record of our work, our collaboration and our awesome researchers.

Perhaps the largest benefit from the ISTC was the number of new relationships that we established with MIT—and beyond.  We learned about the Myria and Vega projects at the University of Washington.  The ISTC connected us with the H-Store group at Brown University and a collaboration leading to a streaming data management system called S-Store. Our BigDAWG research builds closely on work at Northwestern University and the University of Chicago.

In short, the ISTC has connected us to many of the leading thought leaders in database-related technology working in universities today─which is exactly what Intel wants from an ISTC.

***

Jeff Parkhurst, PhD,, is Intel program director for the ISTCs for Big Data and Cloud Computing and a contributor to BigDAWG. Previously, he wrote about the pivotal role of academic collaboration in accelerating practical application of research.

Tim Mattson is the Intel Principal Investigator (PI) for the ISTC for Big Data, a senior principal engineer in the Intel Parallel Computing Lab, and a contributor to BigDAWG. His research focuses on technologies that help programmers write parallel applications, including programming languages (OpenMP and OpenCL), parallel design patterns, and parallel math libraries.  Dr. Mattson was lead author on the CIDR 2017 paper “Demonstrating the BigDAWG Polystore System for Ocean Metagenomic Analysis.”

 

This entry was posted in Benchmarks, Big Data Architecture, Data Management, Databases and Analytics, ISTC for Big Data Blog, Polystores, Streaming Big Data, Tools for Big Data, Visualizing Big Data and tagged , , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


− three = 5