Improving Query Speeds on Vital Industry Big Data Sets

The Intel Science & Technology Center for Big Data is working on many ways to make it easier to access, store, manage and perform analytics on big, gnarly data sets that are vital to major fields of research. One way is to improve query speed. This is where the rubber meets the road: without good query speed, scientists and researchers can’t be interactive and fully exploit the data.

To improve query speed, we’re taking various popular data sets and working on better ways to access them. We’re testing those accesses and setting benchmarks for those data sets. Each benchmark will answer this question: “If I’m doing this type of access, what should be my query speed?”

Here are three kinds of data sets that we’re working with:

Satellite Imagery. When accessing NASA’s multi-terabyte MODIS dataset, researchers often find that the summaries provided are too large for memory. They need low-latency access into the original data set. One established approach is to attempt to hide the latency of the back-end data store, using prefetching and caching of data that will be needed in the near future. We are developing a predictive middleware that will reside between the front-end visualization interface and the backend data store and will predict, prefetch and cache relevant data. (Read the article.)

Web/Social Media Data. Accessing this type of data involves getting a lot of data into one place and correlating it; for example, gathering micro-transactions on Twitter or Facebook, correlating the data and then visualizing it – millions of bits of highly interrelated, constantly changing information. (Read the article.)

Genomics Data. Today there is a plethora of raw genomics data; for example, a single gene-sequencing facility can sequence 2000 people per day and produce 6 TB of data per day, at just over $1000 per genome. But the cost of data management and complex analytics is often prohibitive. In collaboration with the Intel Parallel Computing Lab, the Broad Institute and Novartis, the MIT Database Group has designed GenBase, a benchmark that identifies five representative genomics tasks involving both data management and analytics. GenBase was presented in September at the XLDB 2013 conference at Stanford. (Read the article.)

Those are three good examples of what we’re currently doing. In addition to working on infrastructure, we are working on how to optimize for various types of big data sets. We are figuring out how to make queries run faster and enable interactive queries.

We’re also putting data sets into the public domain so that people can access them, practice on them, and perform test runs on them. Here is the latest list of the data sets that we have assembled or are using in our research at the IS&TC for Big Data.

And here’s the latest list of the software we have developed at the IS&TC for Big Data.

Please be assured that everything we do, including benchmarks, we do in open source.



This entry was posted in Analytics, Benchmarks, Big Data Applications, Databases and Analytics, Graph Computation, ISTC for Big Data Blog, Tools for Big Data and tagged , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

− 3 = six