When developing numerical routines in Fortran you often get told to use static arrays instead of derived data types. The reasoning is quite simple: Data in arrays is stored contiguous in memory, allowing a cache-optimized processing of consecutive operations.
This is true for individual arrays, but typically you require quite a few values from different arrays during your calculations, and those arrays might be anywhere in the memory! Wouldn't it be better to group all required bits into one (contiguous) structure?
In other words, is it really faster to use static arrays?
As a quick benchmark, I devised a very simple example: Consider a set of quadrangles that are specified by their vertices. For these we need to calculate the normal vectors (unit length) and the area. Each quadrangle needs memory to store:
- Four vertices (each comprising coordinates in 3D)
- A normal vector (usually as a normal vector)
- The area of the quadrangle
Additionally, we assume that we need some additional information about each triangle for our numerical code, like identifiers, boundary conditions and/or organizational overhead. While the first part is essential to every step in the computations, the latter is just required occasionally.
The following figure depicts the data requirements with each line representing one quadrangle:
When using static arrays arrays for each attribute of the quadrangles we end up with a seven individual arrays contiguous in memory:
In this scheme, related data shares the same index. However, the individual arrays might be separated in memory. In Fortran, the corresponding declarations become
! STATIC ARRAYS real(cp),allocatable :: q_coord1(:,:) real(cp),allocatable :: q_coord2(:,:) real(cp),allocatable :: q_coord3(:,:) real(cp),allocatable :: q_coord4(:,:) real(cp),allocatable :: q_n(:,:) real(cp),allocatable :: q_area(:) !--- Payload integer,allocatable :: q_node_numbers(:,:) integer,allocatable :: q_id(:) character,allocatable :: q_type(:) logical,allocatable :: q_is_triangle(:) complex(cp),allocatable :: q_some_large_array(:,:)
A more intuitive way is using derived data types to group the data for each quadrangle and store the data as an array of this type:
The derived type and array translate to
! STRUCTURE type :: structQuad real(cp) :: coord1(3) real(cp) :: coord2(3) real(cp) :: coord3(3) real(cp) :: coord4(3) real(cp) :: n(3) real(cp) :: area !--- Payload integer :: node_numbers(4) integer :: id character :: type logical :: is_triangle complex(cp) :: some_large_array(100) end type type(structQuad),allocatable :: q(:)
Instead of eleven different static arrays
q_ that have to be allocated and de-allocated individually, we need to handle just one array
Typically, we calculate all normal vectors in a loop over the quadrangles:
! STATIC ARRAYS do c=1,NOQUAD ! Normal vector a = q_coord2(:,c) - q_coord1(:,c) b = q_coord3(:,c) - q_coord1(:,c) q_n(1,c) = a(2) * b(3) - a(3) * b(2) q_n(2,c) = a(3) * b(1) - a(1) * b(3) q_n(3,c) = a(1) * b(2) - a(2) * b(1) ! Normalize q_area(c) = sqrt(dot_product(q_n(:,c), q_n(:,c))) q_n(:,c) = q_n(:,c) / q_area(c) end do ! c
! STRUCTURE do c=1,NOQUAD ! Normal vector a = q(c)%coord2 - q(c)%coord1 b = q(c)%coord3 - q(c)%coord1 q(c)%n(1) = a(2) * b(3) - a(3) * b(2) q(c)%n(2) = a(3) * b(1) - a(1) * b(3) q(c)%n(3) = a(1) * b(2) - a(2) * b(1) ! Normalize q(c)%area = sqrt(dot_product(q(c)%n, q(c)%n)) q(c)%n = q(c)%n / q(c)%area end do ! c
So let's analyze what is happening here in terms of memory...
In the first loop (static arrays) we load the elements of
q_coord3 related to the current counter into memory (
q_coord4 is not required as we consider rectangles only). Then,
q_n is determined with the help of two temporary vectors, and finally
q_area is found as the length of the normal vector, which is normalized in the last step.
Access to the array elements is not possible element-wise! Instead, chunks of the array is loaded into the CPU registers (memory lines). As a consequence, there is a good chance that the array elements required in the next step are still in the cache.
In the second loop we just load
q(c) into memory and operate on that structure. Again, a chunk of
q is cached, but this time each element is rather huge due to the payload. Thus, the number of elements in the cache is much smaller compared to the static arrays, and we have to perform memory operations a lot sooner and more frequently. As cache operations are much faster than operations on memory, we expect the second loop to be slower than the first one.
To make matters worse, there is additional de-referencing required on the structures. For instance, if we want to access the current normal vector,
q_n(:,c) first gathers the starting address of
q_n and then calculates the address of the c-th element. For the structure, we have to find the starting address of
q, calculate the position of
q(c), and then find the address of the member
q(c)%n. This additional step requires additional time.
To mitigate this problem, one could use a pointer to
q(c) at the beginning of each iteration and then work on that pointer. This way, the first de-referencing is omitted. Especially in large loops with lots of operations this can lead to an enormous speed-up.
Separation into Hot and Cold Parts
One solution to the bad cache utilization is to split the structure into two sub-structures, one containing all data required in the main loop, and one containing the payload. This separation into hot and cold parts could look like this:
! STRUCTURE: HOT AND COLD type :: structQuadHot real(cp) :: coord1(3) real(cp) :: coord2(3) real(cp) :: coord3(3) real(cp) :: coord4(3) real(cp) :: n(3) real(cp) :: area end type !--- Payload type :: structQuadCold integer :: node_numbers(4) integer :: id character :: type logical :: is_triangle complex(cp) :: some_large_array(100) end type type(structQuadHot),allocatable :: qHot(:) type(structQuadCold),allocatable :: qCold(:)
Well, let's see how those schemes perform in our setup!
Our little program (source code) will perform the following tasks:
- Setup the quadrangles: divide the rectangle (0,0,0)-(1,1,0) into 1e6 sub-rectangles
- Setup the (dummy-) payload
- Calculate the unit normal vectors and the quadrangle area
For accurate timings the OpenMP function
omp_get_wtime() is used. Additionally, the different loops are parallelized using
$omp parallel do directives.
To yield comparable results, the program was compiled using the Intel Fortran compiler (2013, update 4) and gfortran (4.7.3/Ubuntu).
In the following tables, the following schemes have been investigated:
- static arrays
- the complete structure
- two structures for the hot and cold parts
- the hot/cold scheme using pointers
The program is compiled with the following options using
ifort -O0 -openmp -o structTest-intel structTest.F90
and the same using
gfortran -O0 -o structTest-gcc -fopenmp structTest.F90
Here come the results:
$ OMP_NUM_THREADS=1 ./structTest-intel STATIC ARRAYS: Setup 0.994 Calculations 0.123 STRUCTURE: Setup 0.855 Calculations 0.137 STRUCTURE HOT : Setup 0.811 Calculations 0.075 STRUCTURE HOT,PTR: Setup 0.732 Calculations 0.061 $ OMP_NUM_THREADS=1 ./structTest-gcc STATIC ARRAYS: Setup 0.607 Calculations 0.073 STRUCTURE: Setup 0.654 Calculations 0.110 STRUCTURE HOT : Setup 0.603 Calculations 0.049 STRUCTURE HOT,PTR: Setup 0.594 Calculations 0.037
Well, this is kind of surprising! Okay, using the complete structure as one derived type is really slow... But the hot/cold scheme outperforms the static arrays by 50%, and the combination with pointers is almost twice as fast as static arrays. And this holds for both compilers.
Now let's do it with four threads:
$ OMP_NUM_THREADS=4 ./structTest-intel STATIC ARRAYS: Setup 0.335 Calculations 0.032 STRUCTURE: Setup 0.264 Calculations 0.039 STRUCTURE HOT : Setup 0.251 Calculations 0.019 STRUCTURE HOT,PTR: Setup 0.236 Calculations 0.017 $ OMP_NUM_THREADS=4 ./structTest-gcc STATIC ARRAYS: Setup 0.211 Calculations 0.019 STRUCTURE: Setup 0.228 Calculations 0.033 STRUCTURE HOT : Setup 0.214 Calculations 0.015 STRUCTURE HOT,PTR: Setup 0.215 Calculations 0.015
The same pattern as before, but while the calculations scale with above 80% efficiency, as soon as pointers come into play efficiency drops to 60%.
Still, in total numbers, using the hot/cold scheme with pointers is faster than using static arrays.
With Optimization -O2
Let's do another step and enable optimization... Now, the program is compiled with
-O2 for both compilers.
$ OMP_NUM_THREADS=1 ./structTest-intel STATIC ARRAYS: Setup 0.531 Calculations 0.028 STRUCTURE: Setup 0.519 Calculations 0.056 STRUCTURE HOT : Setup 0.503 Calculations 0.023 STRUCTURE HOT,PTR: Setup 0.501 Calculations 0.024 $ OMP_NUM_THREADS=1 ./structTest-gcc STATIC ARRAYS: Setup 0.501 Calculations 0.029 STRUCTURE: Setup 0.549 Calculations 0.062 STRUCTURE HOT : Setup 0.521 Calculations 0.022 STRUCTURE HOT,PTR: Setup 0.526 Calculations 0.022
The results look pretty similar to the ones without optimization, although the differences are not that pronounced any more.
With four threads we get:
$ OMP_NUM_THREADS=4 ./structTest-intel STATIC ARRAYS: Setup 0.205 Calculations 0.010 STRUCTURE: Setup 0.220 Calculations 0.021 STRUCTURE HOT : Setup 0.203 Calculations 0.014 STRUCTURE HOT,PTR: Setup 0.200 Calculations 0.016 $ OMP_NUM_THREADS=4 ./structTest-gcc STATIC ARRAYS: Setup 0.201 Calculations 0.009 STRUCTURE: Setup 0.223 Calculations 0.026 STRUCTURE HOT : Setup 0.203 Calculations 0.015 STRUCTURE HOT,PTR: Setup 0.202 Calculations 0.015
From the results you can see two things:
- For parallel execution, static arrays are much faster.
- The pointers to the structures have little impact any more - it seems the compiler does something similar on his own.
Let investigate the impact of the optimization a little further. For this we choose the following compile options:
ifort -fast -O4 -openmp -o structTest-intel structTest.F90
and the same using
gfortran -O4 -funroll-loops -ffast-math -o structTest-gcc -fopenmp structTest.F90
Here are the single-thread timings:
$ OMP_NUM_THREADS=1 ./structTest-intel STATIC ARRAYS: Setup 0.511 Calculations 0.026 STRUCTURE: Setup 0.520 Calculations 0.057 STRUCTURE HOT : Setup 0.470 Calculations 0.023 STRUCTURE HOT,PTR: Setup 0.475 Calculations 0.020 $ OMP_NUM_THREADS=1 ./structTest-gcc STATIC ARRAYS: Setup 0.505 Calculations 0.028 STRUCTURE: Setup 0.533 Calculations 0.060 STRUCTURE HOT : Setup 0.502 Calculations 0.024 STRUCTURE HOT,PTR: Setup 0.496 Calculations 0.022
And four threads give:
$ OMP_NUM_THREADS=4 ./structTest-intel STATIC ARRAYS: Setup 0.209 Calculations 0.009 STRUCTURE: Setup 0.220 Calculations 0.021 STRUCTURE HOT : Setup 0.202 Calculations 0.014 STRUCTURE HOT,PTR: Setup 0.199 Calculations 0.014 $ OMP_NUM_THREADS=4 ./structTest-gcc STATIC ARRAYS: Setup 0.206 Calculations 0.009 STRUCTURE: Setup 0.239 Calculations 0.023 STRUCTURE HOT : Setup 0.203 Calculations 0.014 STRUCTURE HOT,PTR: Setup 0.212 Calculations 0.015
The results of this investigation are two-fold: When writing single-thread programs, the hot/cold scheme with pointers outperforms the static arrays by almost 30%.
When going multi-thread and using optimization, however, static arrays are clearly faster (up 40%).
So, if you are running your computations in shared memory using OpenMP, you are probably better off using static arrays. It would be interesting to take a look at distributed-memory parallelization using MPI to see how the single-thread performance outweighs the more complex communications.