GenBase: A Benchmark for the Genomics Era

By Rebecca Taft, MIT CSAIL*

Genomics is quickly becoming the focus of many Big Data scientists due to the seemingly sudden availability of vast amounts of data.  As mentioned in a previous post, a single gene-sequencing facility can sequence 2000 people per day and produce 6 TB of data per day, at the cost of just over $1000 per genome. Despite this plethora of raw data, the cost of data management (e.g., filters, joins) and complex analytics (e.g., regression, statistics) is often prohibitive, making analysis difficult.

In collaboration with the Intel Parallel Computing Lab, the Broad Institute and Novartis, we at the MIT Database Group have designed GenBase, a benchmark that identifies five representative genomics tasks involving both data management and analytics.  We’re presenting GenBase today, September 10, in a lightning talk at the XLDB 2013 conference at Stanford.

Genomics tasks tend to rely heavily on statistics and linear algebra operations such as matrix multiplication, which are O(n3) in runtime.  As datasets increase in size, performance becomes ever more important. By running these common tasks on a number of different data processing and storage systems, we have discovered surprising differences between these systems in their ability to handle these operations.

Our hope is that this benchmark will allow scientists to determine which systems are best suited to their data processing needs, and will allow application developers to create systems specifically optimized for genomics tasks.  If we are successful, biologists and health care professionals will be able to spend less time on data processing and more time solving important biological problems.

Our hope is that this benchmark will allow scientists to determine which systems are best suited to their data processing needs, and will allow application developers to create systems specifically optimized for genomics tasks.  

The Data

Many people assume genomics data is a series of characters (A, C, T, G) representing a strand of DNA. In fact, sequence data is only one kind of genomics data.  For example, you may have heard that genes are special subsequences of DNA interspersed throughout the genome. In addition to recording the sequence of particular genes, scientists can also measure how often genes are activated to produce RNA and Protein, which carry out the function of the gene. The relative amount of RNA or Protein produced by a gene is called its “expression level.”

The data we are using in our benchmark is called “microarray” data, which is a matrix of floating point numbers indicating the expression level of thousands of different genes for thousands of different patients. Each element of the array is basically a measure of how “active” a particular gene is in a particular patient. This data has many applications for biology and healthcare, but we cover some of the most important below.

The Queries

Using our microarray dataset, we have identified five classes of queries that are relevant to biology and healthcare.  These queries and their applications are summarized in the table below.

Table 1: GenBase Benchmark Query Classes. (Courtesy of Intel Labs and MIT CSAIL.)

The Systems

We will make our benchmark queries available online this fall, and we hope they will be useful for application developers to analyze the performance of their systems on a genomics workload.

As a starting point, we are working to implement the benchmark on the following systems:

  • R ― A popular scientific programming language with extensive implementations of statistics and linear algebra operations.
  • Postgres ― A popular open source, relational DBMS.
  • Hadoop ― An open source implementation of Google’s MapReduce architecture.
  • SciDB ― An array-based database designed to handle large analytic workloads.
  • SciDB + Intel® Xeon Phi™ ― A new coprocessor used in our benchmark to accelerate operations in SciDB.
  • System X (Name changed due to license restrictions) ― A popular columnar RDBMS

Our final results will be available later this fall, but preliminary results show that SciDB, SciDB + Intel® Xeon Phi™, and System X perform best overall. R performs well with analytics, but poorly on data management operations, especially as the size of the dataset increases.  Postgres performs reasonably well on data management, but is not well suited for analytics.  Hadoop performs poorly overall.

Up Next

We’re planning to publish a paper on this benchmark later this fall and will also make the benchmark available online. You can learn more on our website, and check back for updates.

*This ISTC for Big Data project team includes Pradeep Dubey, Nadathur Satish, and Narayanan Sundaram of the Intel Parallel Computing Lab; and Sam Madden, Mike Stonebraker, Rebecca Taft and Manasi Vartak from MIT CSAIL.





This entry was posted in Analytics, Benchmarks, Big Data Applications, Data Management, Databases and Analytics, ISTC for Big Data Blog and tagged , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

eight − = 7