Advanced Array Analytics: Time Travel and Iterations

By Eugene Wu, MIT CSAIL

One area of research focus for the Intel Science and Technology Center for Big Data is improving DBMS functionality for scientists.

Not only do scientific databases such as SciDB have to iteratively analyze Big Data quickly, but they must also efficiently store previous versions of data arrays so scientists can compare previous and current versions of their data.  At the recent ISTC for Big Data Retreat in California, Magda Balazinska of the University of Washington talked about techniques to address both issues.

She briefly presented efficient versioning techniques described in a paper to be presented at ICDE 2013, “Time Travel in a Scientific Array Database.”  The paper describes a new storage manager in SciDB that combines basic techniques such as backward-delta encoding, tiling, and bit-masking to vastly improve the performance of the versioning system as compared to the current design. They also extended their versioning system to support approximation results and a non-consecutive jumping backward delta called “skip links.”

She also talked about their group’s direction towards efficiently running iterative computations on arrays, which is a common task in scientific array analyses. As an example, she illustrated an astronomy algorithm called “sigma-clipping” followed by image co-addition.

In astronomy, some sources are too faint to be detected in one image, but can be detected by stacking multiple images. The pixel summation over all images is called “co-add.” Astronomers run an iterative noise-reduction algorithm (sigma-clipping) before performing co-add.  The noise-reduction algorithm repeatedly compares each pixel location across the stacked images and removes the extreme outliers so that each pixel in the final image is derived from the most “stable” pixels from the stack of images.  Other array iterative examples include “tracking simulated object trajectories” and a Friends-of-Friends clustering algorithm.

Professor Balazinska described possible implementations to perform this iterative array processing. The naive way is to use a driver program and express body of the iterative loop as a series of queries (in SciDB AQL queries), which can be very slow.  Optimization opportunities include techniques such as incremental, asynchronous, and prioritized operations.  She also briefly mentioned work in progress on special fault tolerance techniques for iterative algorithms.



This entry was posted in Analytics, Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog, Tools for Big Data and tagged , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

eight × = 72