By Jack Dongarra, University of Tennessee Knoxville and Innovative Computing Laboratory

The GenBase benchmark was developed as a collaboration with the Intel Parallel Computing Lab, the Broad Institute and Novartis, and the MIT Database Group. Among many challenging tests that the benchmark includes is a computation of the Singular Value Decomposition (SVD) of a large matrix (tens of thousands of rows and/or columns). The reason for this particular computation is that we can now obtain gene correlation matrices from the lab but we don’t know how much information is contained in this correlation matrix of size N and so we turn to SVD, which, among other things, answers this question from the linear algebra standpoint.

The problem is that SVD is a computationally intensive process. To be exact, we need 8/3 N^{3} operations to compute it. N gets large quickly especially when working with the human genome, which has some 23,000 genes. Even for N=100,000, there are over 2.5*10^{15} floating-point operations to perform, and for double the matrix size, which is not uncommon, there is 8 times as much work (the cubic complexity is at fault here). This of course poses problems because it requires a long computing time.

How long? A modern Intel processor core delivers about 50 Gflop/s (50*10^{9} floating-point operations per second). This is only true for the newest Intel architectures (code-named Haswell and Ivy Town) that feature the AVX2 instruction set. A quick calculation of the time-to-solution reveals that it would take over 14 hours to complete SVD of a single matrix at this approximate rate. The recent high-edition of the server-grade Intel Xeon processor (the E5 v2 series) has 15 cores so we can scale the computation time to about one hour.

Unfortunately, modern linear algebra libraries, commercial or open source, do not compute SVD at the rate of 50 Gflop/s per core. The number is closer to 20 Gflop/s and does not scale with the number of cores. This limitation is related to the algorithm chosen for the implementation, which is a result of the computation having the memory bandwidth as the main bottleneck and the bandwidth becoming a scarce resource as the computational capacity increases with the advancements dictated by Moore’s Law. There exist newer algorithms that suffer much less from the bandwidth problem but they increase the computational burden even further without sufficiently benefiting the speed of the calculation beyond the bandwidth limit.

At this point the linear algebra research at the University of Tennessee Knoxville comes into play. The following reasoning led us to a substantial improvement in the state-of-the-art in terms of both algorithmic development and the implementation effort. The classic algorithms suffer the bandwidth problem due to memory-bound operations. Since these operations are necessary to obtain the SVD, we limit their effect by judiciously using the ideas from the more modern algorithms and adding new algorithms that remove the bandwidth problem and use the CPU cores with much greater efficiency.

At the implementation level, we rely on matrix-matrix operations, which reuse data in cache and ease the burden on the memory bus. Much more bandwidth-demanding are matrix-vector operations, which are necessary to obtain the final SVD form. We still use them but on a reduced portion of the matrix rather than on the original one. The reduced portion fits in cache and that increases the execution rate because cache memories offer much higher bandwidth and are duplicated for exclusive use in multicore processors. There is a penalty for the use of a different algorithm: an increase in the number of operations. Fortunately, the increased execution rate and scalability with the number of cores more than make up for it.

## References

[1] Azzam Haidar, Piotr Luszczek, Jakub Kurzak, Jack Dongarra, An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware, Proceedings of SC13, November 17-21, 2013, Denver, CO, USA