Numerical software of any kind heavily depends on libraries for linear algebra. Typically, these library follow the API laid out by Basic Linear Algebra Subroutine (BLAS) and LAPACK (Linear Algebra Package). As the subroutines in these libraries usually determine the overall performance of the numerical software, a lot of effort has been spent on their optimization.
This article compares the performance of some of the commonly used math libraries by applying a LU-factorization of double precision complex and dense matrices (ZGETRF).
For best performance, ATLAS is often chosen, as it automatically tunes the BLAS routines (and some LAPACK routines as well) to work best with the workstation's hardware (see the Wikipedia Article for details). ATLAS has the downside that it is tied to the hardware it was compiled on and might run sub-optimal on another machine.
Intel© provides the Math Kernel Library (MKL) which also includes BLAS and LAPACK routines. Intel claims the MKL to be the "fastest and most used math library for Intel and compatible processors" [link], although there is some criticism about that statement, especially on non-Intel processors [link].
Another highly optimized BLAS library is OpenBLAS based on the hand-optimized assembly routines by Kazushige Goto (GotoBLAS). As the other libraries, OpenBLAS provides the most common LAPACK routines.
There are other math libraries by different vendors out there, but I guess these or the most commonly used, and the only ones I have access to, so the tests described here only include those three.
In my daily work I have to deal with fully populated, double-precision complex matrices. Typically, they have to be LU-factorized during the solution process. Since this operation has the complexity O(n³) and, therefore, dominates the overall simulation process it is the perfect candidate for this test.
So the testing procedure is as follows:
- Setup an n x n matrix
- Fill it with random numbers
- Perform the factorization using the LAPACK routine
The resulting source code is surprisingly small and takes the rank of the matrix as a command line argument.
The compilers used here are the Intel Fortran compiler (2013, update 4) and gfortran (4.7.3/Ubuntu). OpenBLAS is in Version 0.2.8, the Intel MKL in Version 11.0.
Well, here are the timings on two different machines, one with an Intel i7 980, the other with an AMD Opteron 6140.
On the Intel machine you can see that ATLAS outperforms the rest by approximately 25%. OpenBLAS is slightly faster then the Intel MKL.
On the AMD machine, though, OpenBLAS takes the lead and is more than 10% faster. The performance of the Intel MKL and ATLAS are comparable.
From these results, I'd recommend trying ATLAS if performance really matters, but using OpenBLAS for portable programs - it might also be faster, as for the AMD machine in this test.
Both the Intel MKL and OpenBLAS allow for a threaded execution. So the tests above are extended to run in parallel using OpenMP.
Both libraries scale quite well, though the MKL shows poor efficiency for small matrices.