This recommendation is part of the open catalog of best practice rules for performance that is automatically detected and reported by Codee.
Issue
Performance of the loop nest can benefit from loop interchange, but scalar reduction prevents loop interchange.
Relevance
Inefficient memory access pattern and low locality of reference are among the main reasons for low performance on modern computer systems. Matrices are stored in a row-major order in C and column-major order in Fortran. Iterating over them column-wise (in C) and row-wise (in Fortran) is inefficient, because it uses the memory subsystem suboptimally.
Nested loops that iterate over matrices inefficiently can be optimized by applying loop interchange. Using loop interchange, the inefficient matrix access pattern is replaced with a more efficient one. Often, loop interchange enables vectorization of the innermost loop which additionally improves performance.
In order to perform the loop interchange, the loops need to be perfectly nested, i.e. all the statements need to be inside the innermost loop. However, due to the initialization of a reduction variablе, loop interchange is not directly applicable.
Actions
Promote scalar reduction to a vector. Then, perform loop fission to isolate the computation of the scalar reduction in a perfect loop nesting and enable loop interchange. This typically leads to creating three loops: the first loop initializes the scalar reduction; the second loop computes the scalar reduction; and the third loop computes the final result of the original loop. See below a code example on how it is done.
Code example
Have a look at the following code:
for (int i = 0; i < n; i++) {
double s = 0.0;
for (int j = 0; j < n; j++) {
s += a[j][i];
}
b[i] = 0.1 * s;
}
With regards to the innermost loop, the memory access pattern of the matrix a
is strided, and this loop can profit from loop interchange, but reduction variable initialization on line 2 and reduction variable usage on line 6 prevent it.
To make the loop vectorizable, we need to perform scalar to vector promotion on variable s
. What this means is that we replace the scalar variable s
with an array:
for (int i = 0; i < n; i++) {
s[i] = 0.0;
for (int j = 0; j < n; j++) {
s[i] += a[j][i];
}
b[i] = 0.1 * s[i];
}
After doing this, we can use loop fission to move the statements on lines 2 and 6 to separate loops. The result looks like this:
for (int i = 0; i < n; i++) {
s[i] = 0.0;
}
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
s[i] += a[j][i];
}
}
for (int i = 0; i < n; i++) {
b[i] = 0.1 * s[i];
}
The first and the third loop (lines 1 and 9) are not performance critical, since they are not nested. On the other hand, the second loop (line 4) is performance critical, since it contains the loop nest. Fortunately, the loop nest is perfectly nested, so that loop interchange is applicable. The final result looks like this (note the order of the nested loops is now JI instead of the original order IJ):
for (int i = 0; i < n; i++) {
s[i] = 0.0;
}
for (int j = 0; j < n; j++) {
for (int i = 0; i < n; i++) {
s[i] += a[j][i];
}
}
for (int i = 0; i < n; i++) {
b[i] = 0.1 * s[i];
}
Related resources
- PWR039: Consider loop interchange to improve the locality of reference and enable vectorization
- PWR043: Consider loop interchange by replacing the scalar reduction value
- PWR010: Avoid column-major array access in C/C++

Building performance into the code from day one with Codee