by Stavros Papadopoulos, Intel Parallel Computing Lab
The ISTC for Big Data has recently released TileDB, a novel efficient data management system for scientific data, such as graphs, DNA sequences, matrices, maps, and imaging. We have also released an adaptation of TileDB for genomics, called GenomicsDB, in collaboration with Intel Health and Life Sciences. Last week, TileDB and GenomicsDB were announced by the Broad Institute, one of the most important genomics institutes in the world, which is currently using TileDB in its production pipeline. TileDB makes scientific data management fast and easy.
TileDB is an open-source software (released under the MIT license), which is written in C++ for Linux and Mac OS X. The current release focuses on the TileDB storage manager module exposed as a C API library, which makes it easy for programmers to write applications for diverse, complex, parallel, scientific data analytics.
TileDB addresses two important issues compared to existing array data management solutions: sparsity (i.e., when an array contains many zero or empty elements) and updates. TileDB uses flexible tiling to efficiently capture both dense and sparse arrays, and introduces a novel batch-write technique to manage updates. Both features lead to impressive performance gains over existing solutions.
GenomicsDB models genomics data (whole genome sequences, as well as whole exome sequences) in a novel manner, representing them as sparse matrices, which can be efficiently managed by its underlying TileDB storage manager. This leads to enormous performance gains versus existing solutions.
With TileDB, the Broad Institute can perform in minutes certain DNA processing tasks that take days with other tools. TileDB is also one of the key components of Intel’s Collaborative Cancer Cloud, with the main focus being precision medicine towards more effective cancer treatment.
You can find more information at the links below:
TileDB is getting a lot of attention. It was even discussed in the White House Fact Sheet on Precision Medicine.
We have a long agenda for upcoming features. We will soon follow up on this blog post with more information on the internal mechanics of TileDB, as well as detailed benchmark results versus existing solutions.
We hope to build an active open-source community around TileDB, which will work towards bringing together two traditionally disjointed domains, namely Big Data Management and High Performance Computing.