ArrayQL Draft Released for Comment

 

By David Maier, Ph.D., Portland State University

At the XLDB 2012 meeting this week, the first draft of a common language for array data, ArrayQL is being released for public comment. The goal of ArrayQL is to provide a standard means to define, query and manipulate array-structured data in a declarative fashion. (October 12, 2012 update: Here’s the video of the ArrayQL announcement at XLDB 2012.)

Array-model DBMSs store data in vectors, matrices and high-dimensional structures, which enables them to handle complex analyses on large data sets efficiently.  Many Big Data applications – such as gene sequencing, document analysis and astronomical surveys – have large contingents of array-structured data, and need the array-manipulation capabilities that ArrayQL provides to work with that data effectively at scale.

The ArrayQL effort began at XLDB 2011, with representatives from the three major array database providers:  Rasdaman, SciDB (which includes two Big Data ISTC members, Mike Stonebraker and me) and SciQL, along with participants from XLDB representing the user community.

ArrayQL currently comprises two parts: an array algebra, meant to provide a precise semantics of operations on arrays; and a user-level language, for defining and querying arrays. The user-level language is modeled on SQL, but with extensions to support array dimensions.

It has been an interesting process. We began by discussing the essential capabilities for manipulating arrays, to serve as a target for the expressive power of the query language. It wasn’t too difficult to identify a common core of operations across the systems. My role was to produce an array algebra that defines these operations precisely. There was more diversity on other issues, such as how array data should connect with relational data. Should arrays be considered as a special kind of relation (as in SciQL)? Or should relations be a particular form of array (as in SciDB)? Or perhaps arrays are just one more data type that can be stored within relations (as in Rasdaman)?

We couldn’t do it all this time around. Our emphasis was on initial processing a scientist or analyst would want to do with large, disk-resident arrays, in order to obtain a manageable dataset for detailed manipulation and visualization. Later versions will address constraints and update.

We want array-database providers to support ArrayQL, so that developers of array-intensive applications can easily port their applications across array engines to take advantage of their different performance capabilities and scalability support. We expect array-database providers to also distinguish their offerings through tools and additional interfaces (for example, direct connections to the array engine from Matlab or R).

The ArrayQL draft is available here. We invite your comments.

This entry was posted in Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog, Tools for Big Data and tagged , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


7 + = sixteen