When developing numerical routines in Fortran you often get told to use static arrays instead of derived data types. The reasoning is quite simple: Data in arrays is stored contiguous in memory, allowing a cache-optimized processing of consecutive operations.

This is true for individual arrays, but typically you require quite a few values from different arrays during your calculations, and those arrays might be anywhere in the memory! Wouldn't it be better to group all required bits into one (contiguous) structure?

In other words, is it really faster to use static arrays?

## Setup

As a quick benchmark, I devised a very simple example: Consider a set of quadrangles that are specified by their vertices. For these we need to calculate the normal vectors (unit length) and the area. Each quadrangle needs memory to store:

- Four vertices (each comprising coordinates in 3D)
- A normal vector (usually as a normal vector)
- The area of the quadrangle

Additionally, we assume that we need some additional information about each triangle for our numerical code, like identifiers, boundary conditions and/or organizational overhead. While the first part is essential to every step in the computations, the latter is just required occasionally.

The following figure depicts the data requirements with each line representing one quadrangle:

When using static arrays arrays for each attribute of the quadrangles we end up with a seven individual arrays contiguous in memory:

In this scheme, related data shares the same index. However, the individual arrays might be separated in memory. In Fortran, the corresponding declarations become

```
! STATIC ARRAYS
real(cp),allocatable :: q_coord1(:,:)
real(cp),allocatable :: q_coord2(:,:)
real(cp),allocatable :: q_coord3(:,:)
real(cp),allocatable :: q_coord4(:,:)
real(cp),allocatable :: q_n(:,:)
real(cp),allocatable :: q_area(:)
!--- Payload
integer,allocatable :: q_node_numbers(:,:)
integer,allocatable :: q_id(:)
character,allocatable :: q_type(:)
logical,allocatable :: q_is_triangle(:)
complex(cp),allocatable :: q_some_large_array(:,:)
```

A more intuitive way is using derived data types to group the data for each quadrangle and store the data as an array of this type:

The derived type and array translate to

```
! STRUCTURE
type :: structQuad
real(cp) :: coord1(3)
real(cp) :: coord2(3)
real(cp) :: coord3(3)
real(cp) :: coord4(3)
real(cp) :: n(3)
real(cp) :: area
!--- Payload
integer :: node_numbers(4)
integer :: id
character :: type
logical :: is_triangle
complex(cp) :: some_large_array(100)
end type
type(structQuad),allocatable :: q(:)
```

Instead of eleven different static arrays `q_`

that have to be allocated and de-allocated individually, we need to handle just *one* array `q`

!

## Memory Issues

Typically, we calculate all normal vectors in a loop over the quadrangles:

```
! STATIC ARRAYS
do c=1,NOQUAD
! Normal vector
a = q_coord2(:,c) - q_coord1(:,c)
b = q_coord3(:,c) - q_coord1(:,c)
q_n(1,c) = a(2) * b(3) - a(3) * b(2)
q_n(2,c) = a(3) * b(1) - a(1) * b(3)
q_n(3,c) = a(1) * b(2) - a(2) * b(1)
! Normalize
q_area(c) = sqrt(dot_product(q_n(:,c), q_n(:,c)))
q_n(:,c) = q_n(:,c) / q_area(c)
end do ! c
```

and

```
! STRUCTURE
do c=1,NOQUAD
! Normal vector
a = q(c)%coord2 - q(c)%coord1
b = q(c)%coord3 - q(c)%coord1
q(c)%n(1) = a(2) * b(3) - a(3) * b(2)
q(c)%n(2) = a(3) * b(1) - a(1) * b(3)
q(c)%n(3) = a(1) * b(2) - a(2) * b(1)
! Normalize
q(c)%area = sqrt(dot_product(q(c)%n, q(c)%n))
q(c)%n = q(c)%n / q(c)%area
end do ! c
```

So let's analyze what is happening here in terms of memory...

In the first loop (static arrays) we load the elements of `q_coord1`

, `q_coord2`

, and `q_coord3`

related to the current counter into memory (`q_coord4`

is not required as we consider rectangles only). Then, `q_n`

is determined with the help of two temporary vectors, and finally `q_area`

is found as the length of the normal vector, which is normalized in the last step.
Access to the array elements is not possible element-wise! Instead, chunks of the array is loaded into the CPU registers (*memory lines*). As a consequence, there is a good chance that the array elements required in the next step are still in the cache.

In the second loop we just load `q(c)`

into memory and operate on that structure. Again, a chunk of `q`

is cached, but this time each element is rather huge due to the payload. Thus, the number of elements in the cache is much smaller compared to the static arrays, and we have to perform memory operations a lot sooner and more frequently. As cache operations are much faster than operations on memory, we expect the second loop to be slower than the first one.

To make matters worse, there is additional de-referencing required on the structures. For instance, if we want to access the current normal vector, `q_n(:,c)`

first gathers the starting address of `q_n`

and then calculates the address of the c-th element. For the structure, we have to find the starting address of `q`

, calculate the position of `q(c)`

, and then find the address of the member `q(c)%n`

. This additional step requires additional time.

To mitigate this problem, one could use a pointer to `q(c)`

at the beginning of each iteration and then work on that pointer. This way, the first de-referencing is omitted. Especially in large loops with lots of operations this can lead to an enormous speed-up.

## Separation into Hot and Cold Parts

One solution to the bad cache utilization is to split the structure into two sub-structures, one containing all data required in the main loop, and one containing the payload. This separation into *hot* and *cold* parts could look like this:

```
! STRUCTURE: HOT AND COLD
type :: structQuadHot
real(cp) :: coord1(3)
real(cp) :: coord2(3)
real(cp) :: coord3(3)
real(cp) :: coord4(3)
real(cp) :: n(3)
real(cp) :: area
end type
!--- Payload
type :: structQuadCold
integer :: node_numbers(4)
integer :: id
character :: type
logical :: is_triangle
complex(cp) :: some_large_array(100)
end type
type(structQuadHot),allocatable :: qHot(:)
type(structQuadCold),allocatable :: qCold(:)
```

## Performance

Well, let's see how those schemes perform in our setup!

Our little program (source code) will perform the following tasks:

- Setup the quadrangles: divide the rectangle (0,0,0)-(1,1,0) into 1e6 sub-rectangles
- Setup the (dummy-) payload
- Calculate the unit normal vectors and the quadrangle area

For accurate timings the OpenMP function `omp_get_wtime()`

is used. Additionally, the different loops are parallelized using `$omp parallel do`

directives.

To yield comparable results, the program was compiled using the Intel Fortran compiler (2013, update 4) and gfortran (4.7.3/Ubuntu).

In the following tables, the following schemes have been investigated:

- static arrays
- the complete structure
- two structures for the hot and cold parts
- the hot/cold scheme using pointers

**Without Optimization**

The program is compiled with the following options using `ifort`

:

```
ifort -O0 -openmp -o structTest-intel structTest.F90
```

and the same using `gfortran`

:

```
gfortran -O0 -o structTest-gcc -fopenmp structTest.F90
```

Here come the results:

```
$ OMP_NUM_THREADS=1 ./structTest-intel
STATIC ARRAYS: Setup 0.994 Calculations 0.123
STRUCTURE: Setup 0.855 Calculations 0.137
STRUCTURE HOT : Setup 0.811 Calculations 0.075
STRUCTURE HOT,PTR: Setup 0.732 Calculations 0.061
$ OMP_NUM_THREADS=1 ./structTest-gcc
STATIC ARRAYS: Setup 0.607 Calculations 0.073
STRUCTURE: Setup 0.654 Calculations 0.110
STRUCTURE HOT : Setup 0.603 Calculations 0.049
STRUCTURE HOT,PTR: Setup 0.594 Calculations 0.037
```

Well, this is kind of surprising! Okay, using the complete structure as one derived type is really slow... But the hot/cold scheme outperforms the static arrays by 50%, and the combination with pointers is almost *twice* as fast as static arrays. And this holds for both compilers.

Now let's do it with four threads:

```
$ OMP_NUM_THREADS=4 ./structTest-intel
STATIC ARRAYS: Setup 0.335 Calculations 0.032
STRUCTURE: Setup 0.264 Calculations 0.039
STRUCTURE HOT : Setup 0.251 Calculations 0.019
STRUCTURE HOT,PTR: Setup 0.236 Calculations 0.017
$ OMP_NUM_THREADS=4 ./structTest-gcc
STATIC ARRAYS: Setup 0.211 Calculations 0.019
STRUCTURE: Setup 0.228 Calculations 0.033
STRUCTURE HOT : Setup 0.214 Calculations 0.015
STRUCTURE HOT,PTR: Setup 0.215 Calculations 0.015
```

The same pattern as before, but while the calculations scale with above 80% efficiency, as soon as pointers come into play efficiency drops to 60%.

Still, in total numbers, using the hot/cold scheme with pointers is faster than using static arrays.

**With Optimization -O2**

Let's do another step and enable optimization... Now, the program is compiled with `-O2`

for both compilers.

```
$ OMP_NUM_THREADS=1 ./structTest-intel
STATIC ARRAYS: Setup 0.531 Calculations 0.028
STRUCTURE: Setup 0.519 Calculations 0.056
STRUCTURE HOT : Setup 0.503 Calculations 0.023
STRUCTURE HOT,PTR: Setup 0.501 Calculations 0.024
$ OMP_NUM_THREADS=1 ./structTest-gcc
STATIC ARRAYS: Setup 0.501 Calculations 0.029
STRUCTURE: Setup 0.549 Calculations 0.062
STRUCTURE HOT : Setup 0.521 Calculations 0.022
STRUCTURE HOT,PTR: Setup 0.526 Calculations 0.022
```

The results look pretty similar to the ones without optimization, although the differences are not that pronounced any more.

With four threads we get:

```
$ OMP_NUM_THREADS=4 ./structTest-intel
STATIC ARRAYS: Setup 0.205 Calculations 0.010
STRUCTURE: Setup 0.220 Calculations 0.021
STRUCTURE HOT : Setup 0.203 Calculations 0.014
STRUCTURE HOT,PTR: Setup 0.200 Calculations 0.016
$ OMP_NUM_THREADS=4 ./structTest-gcc
STATIC ARRAYS: Setup 0.201 Calculations 0.009
STRUCTURE: Setup 0.223 Calculations 0.026
STRUCTURE HOT : Setup 0.203 Calculations 0.015
STRUCTURE HOT,PTR: Setup 0.202 Calculations 0.015
```

From the results you can see two things:

- For parallel execution, static arrays are much faster.
- The pointers to the structures have little impact any more - it seems the compiler does something similar on his own.

Let investigate the impact of the optimization a little further. For this we choose the following compile options:

```
ifort -fast -O4 -openmp -o structTest-intel structTest.F90
```

and the same using `gfortran`

:

```
gfortran -O4 -funroll-loops -ffast-math -o structTest-gcc -fopenmp structTest.F90
```

Here are the single-thread timings:

```
$ OMP_NUM_THREADS=1 ./structTest-intel
STATIC ARRAYS: Setup 0.511 Calculations 0.026
STRUCTURE: Setup 0.520 Calculations 0.057
STRUCTURE HOT : Setup 0.470 Calculations 0.023
STRUCTURE HOT,PTR: Setup 0.475 Calculations 0.020
$ OMP_NUM_THREADS=1 ./structTest-gcc
STATIC ARRAYS: Setup 0.505 Calculations 0.028
STRUCTURE: Setup 0.533 Calculations 0.060
STRUCTURE HOT : Setup 0.502 Calculations 0.024
STRUCTURE HOT,PTR: Setup 0.496 Calculations 0.022
```

And four threads give:

```
$ OMP_NUM_THREADS=4 ./structTest-intel
STATIC ARRAYS: Setup 0.209 Calculations 0.009
STRUCTURE: Setup 0.220 Calculations 0.021
STRUCTURE HOT : Setup 0.202 Calculations 0.014
STRUCTURE HOT,PTR: Setup 0.199 Calculations 0.014
$ OMP_NUM_THREADS=4 ./structTest-gcc
STATIC ARRAYS: Setup 0.206 Calculations 0.009
STRUCTURE: Setup 0.239 Calculations 0.023
STRUCTURE HOT : Setup 0.203 Calculations 0.014
STRUCTURE HOT,PTR: Setup 0.212 Calculations 0.015
```

## Conclusions

The results of this investigation are two-fold: When writing single-thread programs, the hot/cold scheme with pointers outperforms the static arrays by almost 30%.

When going multi-thread and using optimization, however, static arrays are clearly faster (up 40%).

So, if you are running your computations in shared memory using OpenMP, you are probably better off using static arrays. It would be interesting to take a look at distributed-memory parallelization using MPI to see how the single-thread performance outweighs the more complex communications.