By Rebecca Yale Taft, MIT CSAIL

One area of research focus for the Intel Science and Technology Center for Big Data is investigating the tight integration of array DBMSs with ScaLAPACK, a high-performance linear algebra software package, to create a faster, more efficient way to analyze very large, complex data sets.

At the ISTC’s recent Research Retreat, ISTC Principal Investigator Jack Dongarra and Thomas Herault from the University of Tennessee discussed their latest work in this area. The goal of their research is to help SciDB, a popular array-based DBMS, perform common linear algebra operations competitively while implementing fault tolerance and taking advantage of coprocessors.

**Capitalizing on Coprocessors**

Professors Dongarra and Herault started by reviewing the existing linear algebra packages available: BLAS (Basic Linear Algebra Subroutine), which is designed for basic dense linear algebra such as vector and matrix multiplication; LAPACK (Linear Algebra PACKage), which builds upon BLAS and is designed for more complicated sequential operations such as linear least squares, eigenvalue problems, and singular value decomposition; and ScaLAPACK (Scalable Linear Algebra PACKage), which is a version of LAPACK designed for parallel distributed memory machines.

Next they discussed some of the latest software packages under development, such as MAGMA (Matrix Algebra on GPU and Multi-core Architectures), PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures), and DPLASMA (Distributed PLASMA). These software packages are all related in that they depend on BLAS and LAPACK, but can take advantage of extremely fast coprocessors such as Intel’s Xeon Phi architecture. DPLASMA is the latest effort, and uses a new distributed Direct Acyclic Graph engine to improve performance.

**Improving Fault Tolerance with Checkpoint on Failure**

The latest work that Dongarra and Herault spoke about was the implementation of fault tolerance in these linear algebra libraries. A fault-tolerant version of ScaLAPACK is important for SciDB because otherwise processes would have to start from scratch after failure and cause an unacceptable lag time.

One idea is to use Algorithm-Based Fault Tolerance such as including check-sums or CRC in factorization algorithms like QR decomposition. But when using MPI with distributed systems, the state may be left undefined if one of the parallel processes fails. To prevent the algorithms from having to start over, some existing fault-tolerant systems use regular checkpointing, but this technique adds an overhead of almost 25% since it requires regular writes to disk. A new technique called Checkpoint on Failure reduces this overhead and creates an optimal number of checkpoints by causing each process to save a checkpoint only in case of failure. This reduces the overhead of fault-tolerance to only 10%, and means that traditional ScaLAPACK may be accurately labeled “last-century.”

Further plans include coprocessor-based computing in which threads alternate between task execution and scheduling, as well as data pipelining to avoid data reshuffling.