Meet Prochlorococcus marinus, a marine cyanobacterium that’s intricately linked to the global carbon cycle, widely present in seawater, and possibly holds secrets to future climate change. These secrets could be revealed faster and sooner with the help of new big data technologies under development by the ISTC for Big Data in collaboration with the Chisholm Lab at MIT.
The Chisholm Lab specializes in microbial oceanography and cross-scale systems biology. The lab has been archiving frozen seawater samples from the across the globe for many years, in anticipation of one day being able to use them to analyze the relationship between the diversity of Prochlorococcus and specific environmental variables. A new, unconventional polystore database architecture being developed by the ISTC—called BigDAWG—may just be the key to unlocking those secrets by simplifying and speeding analysis of complex ocean metagenomics data.
Finding Clues from a Sea of Genomic Data
Prochlorococcus is the most abundant photosynthetic organism on Earth and it plays important roles in both the global carbon cycle and the ocean food web. Studies of Prochlorococcus, which have grown in number over the past decade, may, in the long term, help improve scientists’ ability to predict how climate change will impact the environment at large and how biogeochemically important marine microbial communities may respond to changing environmental pressures.
The Chisholm Lab studies the role of Prochlorococcus in the ocean’s metabolism. The Simons Foundation funds the sequencing and analysis part of the work. Essentially, seawater is collected from various parts of the ocean, then the microbes in each water sample are collected on a filter, frozen and transported to MIT. Back in the lab, the scientists break open the cells and randomly sequence fragments of DNA from those organisms. The dataset contains billions of FASTQ-format sequences along with associated metadata such as the location, date, depth and chemical composition of the water samples.
Chisholm Lab researchers are interested in relating communities of microbes and the presence of certain genes in a particular sample with environmental parameters (for example, light, temperature and the chemical composition of the seawater). The data allow them to look for patterns—for example, whether or not a particular gene is relatively abundant in specific regions of the world.
In addition to these samples collected from around the world, the Chisholm Lab is also sequencing samples from two locations in the ocean, collected at monthly intervals for two years. Analyses of these massive sequencing datasets could help lead to new knowledge about seasonality and temporal changes in marine microbial communities and the factors that control these patterns.
Wanted: Faster, More Efficient and Scalable Analytics
With the ability to collect ever-increasing volumes of complex and diverse types of data, microbial ecologists face a number of challenges:
- The time and cost of expeditions to collect samples from the ocean
- The volume and variety of the datasets make it difficult to integrate, explore and/or summarize them
- Each metagenomic dataset contains an incredibly diverse mixture of micro-organisms present in the ocean, but extracting sequences related to known or unknown organisms of interest is a big computational and data management challenge
- Currently, correlating metadata with genomic sequences to extract insights requires one-off solutions that can’t be easily applied to different, but related, problems
In short: the Chisholm Lab has the classic “big data” problem: data volume, variety and velocity that limits data exploration by scientists and thus may delay or prevent important discoveries. This is becoming a common problem for scientific researchers everywhere, across many sub-disciplines.
The data include:
Raw DNA sequence data: The largest component of the data, the sequence data is broken down into two primary files (one for the beginning of the sequence and one for the end), with on average ~20 million unique sequences for each sample.
Discrete sample metadata: This data is provided by the GEOTRACES consortium of marine chemists who collected the seawater samples on cruises. This piece of the dataset is a large table containing information about the concentration of different metals in the water at each site as well as information about other chemical and physical properties of the seawater (e.g., macro-nutrient concentration, temperature). There are nearly 500 different entities measured, which is stored in a Postgres relational database.
Sensor metadata: This data also comes from the GEOTRACES consortium. This piece of the data contains information about where each of the samples came from and information such as light levels and salinity.
Cruise reports, free-form text reports written by researchers on board each voyage.
Streaming data from the SeaFlow underway flow cytometer system. The ISTC will use this system as a simulator for analyzing future data streams of microbial abundance measurements onboard research vessels. Created at the University of Washington by Jarred Swalwell and the lab of Virginia Armbrust, the SeaFlow System is designed to continuously measure the abundance and composition of microbial populations, making it possible to analyze the equivalent of one sample every three minutes.
Overcoming the Challenges with BigDAWG
In tackling the Lab’s challenges, BigDAWG developers are asking questions such as:
- How can we make it easier for scientists to quickly explore the entire dataset?
- How can we improve the efficiency of expeditions by enabling the cruises to make sense of diverse data types, in real time, to inform sampling decisions?
- How can we assign genes to known reference organisms by aligning them against a reference database that’s regularly being updated?
- How can we enable biologists to analyze multiple types of sequence data simultaneously (e.g., single cell genomes vs. whole community metagenomes)?
- How can we accommodate the scale of data along with visualizations so that researchers can look simultaneously across the many different dimensions?
In each case, the “devil is in the diversity”: Getting different kinds of data to operate together in a way that lets scientists get answers without having to know which databases or tool(s) to use.
The BigDAWG architecture is based on the reality that “one-size-fits-all” databases don’t work. It is well-accepted today that purpose-built databases are optimal for the different types of data (streaming, structured, semi-structured, graphical, etc.). It’s also obvious that it’s no longer practical to bring the data to the analytics—for example, through repetitive ETL, building data warehouses, or throwing everything into a data lake and letting users fend for themselves. We need to bring the analytics to the data, and do this with simplicity and scale.
However, integrating multiple databases so that users can just get their questions answered is a programming and administrative nightmare.
The BigDAWG polystore system enables:
- Global queries that can be run on any local engine (providing location transparency)
- Programmers to use common programming and data models that can operate efficiently
- Administrators to move data around, for optimization, load balancing and so on
- A repeatable, efficient system for data integration and consolidation
Conceptually, BigDAWG follows two principles: 1) The future of data management and analytics will be defined by working with disparate data sources and 2) There is no single language or model that will work efficiently for all such datasets.
The BigDAWG project team will provide an update on this work at the ISTC for Big Data Annual Research Retreat August 24 & 25 in Hillsboro, Oregon.
BigDAWG team members will demonstrate how Chisholm Lab scientists will be able to:
- Quickly explore the massive dataset
- Visualize real-time and historical data from on board the boats, enabling navigation to be adjusted in real time to find optimal seawater samples
- Analyze DNA sequence data at scale quickly and granularly, including finding potentially important outliers
- See the relationships between four different types of microorganisms, enabling the creation of 3D predictive models
The ISTC previously demonstrated BigDAWG’s ability to handle complex datasets from medicine (MIMIC II).
Watch this blog for more detailed updates on how BigDAWG will enable fast, efficient and scalable analytics of ocean metagenomics data for the Chisholm Lab.
Biller, S. J., Berube, P. M., Lindell, D., & Chisholm, S. W. (2015). Prochlorococcus: the structure and function of collective diversity. Nature Reviews Microbiology, 13(1), 13–27.
“The Case for Polystores” by Michael Stonebraker. ACM blog, July 13, 2015.
“The BigDAWG Polystore System and Architecture.” Vijay Gadepally, Peinan Chen, Jennie Duggan, Aaron Elmore, Brandon Haynes, Jeremy Kepner, Sam Madden, Tim Mattson, Michael Stonebraker, IEEE High Performance Extreme Computing (HPEC) 2016. To appear.
“ISTC to Unveil New Big Data Federation Architecture at VLDB 2015.” ISTC for Big Data blog, August 13, 2015.
“The BigDaWG Polystore System.” Jennie Duggan, Aaron Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Samuel Madden, Dave Maier, Tim Mattson, and Stan Zdonik. Sigmod Record, 44(3), 2015.
“A Demonstration of the BigDAWG Polystore System.” Aaron Elmore, Jennie Duggan, Michael Stonebraker, Magda Balazinska, Ugur Cetintemel, Vijay Gadepally, Jeffrey Heer, Bill Howe, Jeremy Kepner, Tim Kraska, Sam Madden, David Maier, Timothy Mattson, Stavros Papadopoulos, Jeff Parkhurst, Nesime Tatbul, Manasi Vartak, Stan Zdonik. Proceedings of the VLDB Endowment, 8(12), August 2015.