By Jack Dongarra, Piotr Luszczek and Thomas Herault of the University of Tennessee Innovative Computing Laboratory

It is trite to say that traditional RDBMS optimize the data movement by bringing the query close to the data and not the other way around. It is only natural to apply this simple technique to Big Data stores, which have to perform analytics processing that includes graph traversal and statistical inference — which is at the core of machine learning algorithms. In the end, these computational methods rely on linear algebra operations that involve vectors, matrices and tensors.

The combination of the data’s inherent structure and the statistical methods at hand often call for sparse representation both in memory and on disk. This explains the resurgence of interest in graph databases and projects such as TileDB that use sparsity of the data as the primary design requirement. This poses an interesting conundrum with respect to the software stack modularization, as the computational methods for sparse matrices call for a different balance of crucial hardware characteristics when compared to dense matrix algorithms.

The dense matrix algorithms require high floating-point execution rates as they involve direct computational approaches that favor highly accurate results — as accurate as the input data. The sparse matrices, on the other hand, are much less focused on floating-point performance and instead require high memory bandwidth and the ability to alleviate the latencies inherent in the memory hierarchy. To put it simply, dense matrix operations are computationally intensive, while the sparse matrix methods are bandwidth-constrained.

It is no longer possible to take advantage of the surface-to-volume effect of dense computations (the surface being the matrix and the volume the amount of operations required to produce a result), but rather it is essential to optimize data movement between hardware and software components.

These two opposing ends of the requirements spectrum have drastic implications for the data analytics component of a database focused on sparse matrices. It is no longer possible to take advantage of the surface-to-volume effect of dense computations (the surface being the matrix and the volume the amount of operations required to produce a result), but rather it is essential to optimize data movement between hardware and software components. The matrix data can no longer be extracted from the database, an O(*n*^{2}) operation, and handed over to the analytics module for processing, an O(*n*^{3}) operation. Instead, the matrix data of size *k*, with *k* being an average number of non-zero elements per row (*k* << *n*), are involved in O(*tkn*) computations, with t being a small (*t* <<*n*) number of iterations, and depend on the data and the numerical method used for the analysis. Another complication is the fact that the total amount of data exceeds the available RAM: if it fits, then it could be stored and handled by traditional databases and algorithms.

The solution to the problems outlined above is a tight integration of the database engine and the analytics module so that they exchange the data at low overheads and small increments, which helps in pipelining the transfers from disk, through the DB system (to handle, for example, data marshaling, layout translation, or decompression) and into the compute devices (be it the CPU or a hardware accelerator such as Xeon Phi). At the right granularity, a pipeline is a simple, yet very effective tool for hiding latency and enabling parallel execution by the modules involved. And latency is a great concern as the data may need to travel from the disk a few times before the numerical method reaches the desired accuracy.

There are other benefits of storage-analytics integration; for example, the ability of the analytics module to produce hints to the storage module about the overall access pattern or, more directly, to prefetch specific parts of the data, which would allow the storage manager to issue disk seeks early and offset the associated delay so that the requested data starts streaming from the disk(s) as soon as the actual read call is issued. As the primary storage moves away from mechanical HDDs toward SSDs and NV-RAM cards, an overhead of a disk seek is increasingly becoming a relic of the past, but a prefetch instruction is a deeply entrenched staple of optimization while moving data through the multi-level cache hierarchies, coherency domains, NUMA islands, and PCI segments.

As the primary storage moves away from mechanical HDDs toward SSDs and NV-RAM cards, an overhead of a disk seek is increasingly becoming a relic of the past, but a prefetch instruction is a deeply entrenched staple of optimization while moving data through the multi-level cache hierarchies, coherency domains, NUMA islands, and PCI segments.

It seems then that sharing is caring, not only when it comes to a preschool playground, but also for performance of large-scale analytics on Big Data.