By Manasi Vartak, MIT CSAIL
Genomics is quickly becoming a major source of Big Data due to advances in sequencing technology that make it much faster and less expensive. It is now possible to gene-sequence more than 2000 people per day at a sequencing facility. At 3 GB/genome, such a facility produces 6TB of data per day. However, we are unable to analyze this data at scale.
In this project, the MIT Database Group is working with collaborators from the Intel Parallel Computing Lab, Novartis and the Broad Institute to identify workloads representative of genomic analyses and compare their performance across various systems.
Genomic computations often involve extensive linear algebra and statistics operations, making traditional SQL analytics inadequate.
The benchmark is starting with an initial set of workloads ranging from SVD (Singular Value Decomposition) to statistics and measuring performance across systems like R, Postgres, SciDB, and Hadoop. The benchmark is currently being polished and will be available this fall.
The ISTC for Big Data project team includes Pradeep Dubey, Nadarthur Satish, and Narayanan Sundaram of the Intel Parallel Computing Lab; and Mike Stonebraker, Sam Madden, Rebecca Taft and me from MIT CSAIL.