Standardizing Query Access to Array Databases

By Rebecca Yale Taft, MIT CSAIL

Array-based DBMSs are becoming more important for the scientific community because of the complexity and size of the databases that scientists usually work with. But to date, array-based systems have lacked a standardized query language like SQL.

At the recent ISTC for Big Data Research Retreat in California, Dave Maier of Portland State University reported on progress in his work helping ArrayQL and R become standard front-end interfaces for SciDB and other array-based databases.

ArrayQL is an array query language designed to work with the latest array-based DBMSs including SciDB, SciQL, Rasdaman, and SLAC.  The goal of ArrayQL is to combine familiar concepts such as array algebra and query languages such as SQL, and standardize a language for use with array-based databases. At the moment one of the main distinctions between ArrayQL and SQL is ArrayQL’s provision for array dimensions as well as simple array algebra, but plans have been laid out to include in the language simple matrix operations such as matrix transpose as well as more complex operations such as matrix multiplication.

Could Array-Based Analytics Help “Stressed Metacities?”

One exciting application of this new technology is in “Urban Informatics.”  For example, Singapore has a number of projects under way, including the Future Cities Lab and LIVE Singapore!,  These are research centers that are working on making “stressed metacities” more sustainable through petascale sensing and real-time decision-making. Array-based databases and languages could aid these projects by allowing for a city-scale “nowcast,” monitoring temporal and spatial data and performing interpolation and modeling. Intelligent signals could be generated, including travel time, energy use, safety, public health, air quality, and micro-weather forecasts. One of the biggest contributions could be a “nowcast” capable of sensing both air quality and people’s locations, and determining a program to reduce human exposure to air pollution.

Combining the Power of R and SciDB

Dave Maier also spoke about his work on making R an optional front-end to SciDB.  Because the R programming language is one of the most popular statistical packages currently in use by the academic community, it’s natural to let people continue to use R but accelerate the back-end computations with SciDB. In reality, some operations can be done faster in R, and some can be done faster in SciDB, depending on the operation and the size of the array. The team’s work has been to determine when it makes sense to execute the queries using the data in SciDB and when it makes sense to let R handle it, while keeping the details transparent to the user. Some other opportunities they are investigating are: how to minimize data movement using a cost-based optimizer, how to find an optimal staging of each operation given the cost of data movement, and how to minimize the size of intermediate results.

This work was done in collaboration with Patrick Leyshock, PhD candidate in Computer Science at Portland State University, who also presented these results at the poster session at the Retreat.   For more information on ArrayQL, visit the web site at http://www.xldb.org/arrayql/.

 

This entry was posted in Analytics, Databases and Analytics, DBMS, ISTC for Big Data Blog and tagged , , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


nine − 4 =