In-Database Analytics for Large Array Data

UT ICL_Jack Dongarra_Piotr Luszczek_Thomas Herault_August 2017

By Jack Dongarra,* Piotr Luszczek and Thomas Herault of the University of Tennessee, Knoxville 

Performing analytics inside a database gets progressively important. In the context of SciDB, the data model involves arrays either fully populated (dense) or with empty entries (sparse).

Very quickly, then, the amount of data stored in the database grows beyond the capabilities of traditional data stores. Worse yet, the natural language of array processing includes linear algebra operators that are commonly implemented outside of a database context with High Performance Computing (HPC) and scalability in mind. The target hardware platforms feature tens or even hundreds of thousands of processor cores connected with a scalable interconnect whose price often is a substantial fraction of the system cost. Data storage is limited to parallel file systems on dedicated nodes–an I/O subsystem that is separate from the compute nodes but still connected with the same interconnect. The fault-tolerance semantics are limited to a simple checkpoint-restart scheme with only weak persistence guarantees.

This is clearly in contrast to the usual capabilities offered by most database systems that strive towards ACID and CAPS. The split between the analytics (compute) and reliable storage worlds is clear and breaching the two commonly includes data copies to fit the needs. But with data growth, the copies become a significant overhead with increasingly high associated costs of data extraction, redistribution, and copy-back operations.

We will show how in-database analytics addresses this problem directly with the use of modern database systems and established numerical algorithms.

Problem Statement

We address a common dimension reduction operation that is commonly performed through the invariant subspace transformations to arrive at SVD: Singular Value Decomposition–a reliable numerical technique that deals with two-dimensional input and produces significant data components combined with the singular values that may be regarded as estimates of importance of these components. Among numerous applications outside of the big data processing field, SVD is the main computational operation in PCA (Principal Component Analysis) and some recommender systems. Before other methods are applied, SVD gives an immediate indication of numerical properties of the matrix that models the relationship between the problem components, which may reveal the potential for further analysis.

Implementing an efficient and accurate algorithm that computes SVD must involve issues of numerical linear algebra and High Performance Computing.  Numerical linear algebra guides the choice of algorithmic components that will guarantee accurate computation in the presence of not only round-off errors but also bad conditioning of the original input data that might be contaminated by measurement errors and be singular in working floating-point precision. HPC assures an optimal solution to the modern hardware resources characterized by deep cache hierarchies, heterogeneous computing units and RAM memory. Both of these aspects of the algorithm easily exceed the expertise of developer teams for database systems. As a consequence, SVD code would often come from external sources such as open source software libraries like LAPACK[1] or ScaLAPACK[2]. However, this poses the very problem we try to address here: bridging the data gap between the database storage model and the matrix layout required for the linear algebra libraries.

Theoretical Background

The early assumption that we make here is that we will only deal with sparse problems. In database terms, the stored arrays have most of their entries empty or set to either NULL or zero. The techniques used for such sparse input matrices are much different than the ones for dense arrays because we either do not want (for efficiency) or simply cannot afford to store most of the entries. The methods of choice for large sparse systems involve iterative methods. They work by exploring the Krylov subspace:

Kn(A, x) ≡ {Ax, A2x, A3x, …, Anx}

Our input matrix A may represent, for example, gene expression correlations, most of which might be below a threshold and be considered unimportant. Clearly then, A is sparse with most entries empty. We can select an n-element vector at random and start multiplying it by the successive powers of AA1A2A3, …, An; the result of the multiplication will be getting closer to an eigen-vector of A. Using this observation leads to the so-called power method. There is more technical detail to consider such as the fact that successive powers of A will quickly loose numerical precision. But the take-away is that a lot can be achieved with fairly simple operations that the SciDB system already has: array multiplies and vector dot-products.

SciDB Primitives and Their Use in SVD

The computational primitives relevant for sparse SVD can be exposed by SciDB in a number of ways. User Defined Operators (UDOs) are written in C++ and are inserted into a SciDB running instance as a dynamic library (DLO). Another method is to use AFL or AQL queries. The former is Array Functional Language and the latter is Array Query Language. The queries are submitted through the HTTP protocol implemented on the SciDB side by the Shim plugin. These interfaces give access to native implementations of the dense matrix-matrix multiply call gemm()[3] and dense SVD call gesvd()[4]. For sparse arrays, SciDB implements matrix-matrix-multiply as a spgemm()call[5], which takes advantage of sparsity and performs the calculations on non-empty entries. Dot-product operations may be implemented in terms of more database-like calls such as apply() and aggregate(sum()). Physically, SciDB multiplies corresponding elements in the input vectors and then uses the aggregate() to sum across them. Access to all these operations goes beyond the standard technique in sparse matrix methods called reverse communication whereby only the sparse matrix-vector multiplication is off-loaded to the user and the iterative algorithm performs all the rest of the operations internally.

In our approach, all computation and all storage are off-loaded to SciDB and the algorithm itself only extracts small data items necessary to assess the progress of the iterative process. In this scenario, the user is the SciDB instance requesting a subset of singular values of the input data but neither that data nor the intermediate objects ever make it out of the database, which is beneficial for both storage and performance.

*also of Oak Ridge National Laboratory and the University of Manchester

[1] E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. LAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, Third edition, 1999.
[2] J. Choi, Jack J. Dongarra, Susan Ostrouchov, Antoine Petitet, David W.Walker, and R. Clint Whaley. The design and implementation of the ScaLAPACK LU, QR,and Cholesky factorization routines. Scientific Programming, 5:173–184, 1996.
[3] Antoine Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. Computer Science Department, University of Tennessee, Knoxville, Tennessee, December 1996. PhD dissertation.
[4] L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D’Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1997.
[5] John R. Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333–356, 1992.
This entry was posted in Big Data Architecture, DBMS, High-Performance Computing, ISTC for Big Data Blog, Math and Algorithms and tagged , , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

− 3 = three