Unleashing NASA MODIS Data for Earth and Ocean Scientists

By Leilani Battle, MIT CSAIL; James Frew, University of California, Santa Barbara; and Bill Howe, University of Washington

With the massive influx of data streaming in from telescopes, satellites, sequencers, imagers and other instruments, research in the physical and biological sciences is becoming more data-driven. At the same time, data is being consolidated in public databases and shared online, giving researchers access to a wider range of data sets across many domains. In this post, we will explore the challenges associated with improving the utility of data from the NASA MODIS project.

The National Aeronautics and Space Administration (NASA) Moderate Resolution Imaging Spectroradiometer (MODIS) is a satellite instrument used to measure 36 different spectral bands, or groups of wavelengths, from the earth’s surface and atmosphere. MODIS is deployed on the NASA Terra and Aqua polar-orbiting satellites, which cover the entire earth in one- to two-day cycles. MODIS easily generates terabytes of high-precision spectral measurements on a weekly basis.

NASA’s MODIS data set is a crucial resource for many scientists across several research domains, as it allows researchers to monitor and learn models for important land/ocean systems and processes. Consider the following two examples.

SeaFlow Underway Flow Cytometry

 The SeaFlow environmental flow cytometer

Figure 1: The SeaFlow environmental flow cytometer

The SeaFlow environmental flow cytometer  developed at the University of Washington by Jarred Swalwell, Francois Ribalet and Ginger Armbrust, passes seawater through a fine capillary, identifying particles in the scattered light and fluorescent emission of individual particles at various wavelengths of light. This process allows us to continuously measure the abundance and composition of microbial populations, making it possible to analyze the equivalent of one sample per kilometer ― a dramatic improvement over conventional sampling techniques.

The ability to issue complex queries against the full-resolution MODIS satellite images is critical for validating this instrument ― the population counts measured by SeaFlow directly are compared with estimates of microbial concentrations inferred from chlorophyll measurements, which are in turn inferred from sea color.  Without SciDB (the distributed array-based database), only the heavily down-sampled web data products are available for reference, but they are too inaccurate to compare meaningfully with a local, high-resolution in situ instrument like SeaFlow.  As the number of SeaFlow-equipped vessels increases ― deployments are planned for entire fleets of shipping vessels ― the ongoing integration of satellite data with SeaFlow data at scale with SciDB becomes increasingly critical.

Figure 2: Example of flow cytometric signatures of phytoplankton populations in the North Pacific Ocean. (a) Red fluorescence from chlorophyll versus forward light scattering (a proxy of cell size) uniquely identified five distinct phytoplankton populations: large and small elongated phytoplankton, large and small nanoplankton, and ultraplankton. (b) Orange fluorescence from phycoerythrin versus forward light scattering was used to identify the cyanobacteria Synechococcus, cryptophytes and fluorescent microspheres (beads) added as an internal standard.

Fractional Snow Cover

More than a billion people depend on seasonal snow cover or mountain glaciers for their water supplies. Accurate monitoring of the state and extent of mountain snow cover is critical to the management of water supplies in a changing climate. Researchers at UCSB and JPL have developed a method for retrieving fractional snow covered area (fSCA) from MODIS satellite imagery, which delivers greatly improved snow mapping accuracy over traditional thresholding techniques.

Basically, the fSCA method compares the “color” of each satellite image pixel (the “color” in this case being composed of both visible and infrared light, not just the red, green, and blue that humans perceive as color) to a set of reference colors mixed from the colors of known landscape features, like snow, rocks, plants, and soil. The reference color that’s the best match is assumed to indicate the actual mixture of landscape features in the pixel. Thus, we can say which fraction (say, 20%) of the pixel is snow-covered, as opposed to simply saying it does or doesn’t contain snow.

Figure 3 shows the difference between mapping fractional snow cover versus simply deciding a pixel is is or isn’t snow-covered.

The fSCA method is simple to describe but computationally intensive to execute ― each pixel in an image must be compared to a library of landscape feature colors, which are then mixed together in multiple proportions. Several thousand comparisons per pixel are necessary to find the best mixture.

Challenges Using MODIS Data

Given the massive amount of data to process and the complexity of raw low-level MODIS data, researchers are only able to use down-sampled, pre-generated MODIS data products. Examples are the MODIS land products and MODIS ocean color and sea surface temperature products, which are used in the examples above. The MODIS data products have a specific set of variables available for each product, making ad-hoc analysis difficult, as researchers have no way to generate new variables or recompute variables for comparison. Researchers also have no control over the provenance or precision of the pre-computed data, making debugging their workflows and explaining anomalies and errors difficult.

The best-case scenario for researchers would be to have direct access to a massive database of MODIS data, where they can execute queries to perform ad-hoc analysis, and the computations used to generate common variables are easily accessed through stored procedures. Gary Planthaber et. al. made a first step in this direction with the EarthDB system, which is a full end-to-end system able to load low-level MODIS land data directly into the distributed array-based database SciDB. The EarthDB system gives researchers the ability to perform fine-tuned analyses on arbitrary regions and properties of low-level MODIS land data. The images below were generated from query results from EarthDB, which only scratch the surface of what is possible with EarthDB.

Figure 4: Red, green and blue (RGB) composite image, with RGB values computed using the EarthDB system (from the EarthDB paper).

Figure 5: Visualization of the Normalized Difference Vegetation Index (NDVI) over southern California and Mexico, with NDVI values calculated using EarthDB (from the EarthDB paper).

The biggest challenge in extending the EarthDB concept, and our current focus, is re-implementing domain-specific variables necessary for research analysis. For example, it is non-trivial to compute chlorophyll measurements on the raw MODIS data, a variable provided directly in the MODIS ocean product.


Leilani Battle is a third-year graduate student at MIT, working with Mike Stonebraker and Sam Madden in the CSAIL Database Research Group. Her current work focuses on producing scalable end-to-end data visualization systems by leveraging the computational power of database systems on the back-end.

James Frew is an Associate Professor of Environmental Informatics in the Bren School of Environmental Science and Management at the University of California at Santa Barbara and a principal investigator in UCSB’s Institute for Computational Earth System Science (ICESS).  His current research focuses on the discovery, provenance, curation, and immersive visualization of geospatial information.

Bill Howe is the Director of Research for Scalable Data Analytics at the University of Washington eScience Institute and holds an Affiliate Assistant Professor appointment in Computer Science & Engineering, also at U.W. His research spans scientific databases, data-intensive scalable computing for science, and visual analytics.

This entry was posted in Analytics, Big Data Applications, Big Data Architecture, ISTC for Big Data Blog, Tools for Big Data, Visualizing Big Data and tagged , , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

4 + = six