We use Codee to identify opportunities to increase memory efficiency through Loop Interchange, taking advantage of sequential memory access patterns and favoring efficient vectorization.
Why is Loop Interchange important?
Loop interchange is a performance optimization technique that is used to improve the loop’s memory access pattern and potentially enable vectorization. Loop interchange if applied correctly can yield a huge performance improvement.
Loop interchange is applicable to loop nests, which consist of two or more nested loops. After loop interchange, two nested loops are swapped so that the inner loop becomes the outer and the outer becomes the inner.
Automation of Loop Interchange
Loop interchange is a sophisticated programming technique that is not fully automated in current tools. Codee brings novel innovations in this space through its built-in support to detect Loop Interchange opportunities in perfectly-nested and non-perfectly nested loops (including PWR039, PWR042, PWR043 from the open catalog of performance optimization best practices). Consider the implementation of the matrix-matrix multiplication shown below:
28 void matmul(int n, double *A, double *B, double *C) {
30 for (int i = 0; i < n; ++i)
31 {
32 for (int j = 0; j < n; ++j)
33 {
34 double c = 0.0;
35 for (int k = 0; k < n; ++k)
36 {
37 c += A[i * n + k] * B[k * n + j];
38 }
39 C[i * n + j] += c;
40 }
41 }
42 }
Running Codee for this code, it reports PWR043 “consider loop interchange by replacing the scalar reduction value”. After rewriting the source code according to best practices, the following loop is created:
28 void matmul(int n, double *A, double *B, double *C) {
30 for (int i = 0; i < n; ++i)
31 {
32 for (int k = 0; k < n; ++k)
33 {
34 for (int j = 0; j < n; ++j)
35 {
36 C[i * n +j] += A[i * n + k] * B[k * n + j];
37 }
38 }
39 }
40 }
Note that after loop interchange, the ordering of the nested loops changes from IJK to IKJ. This favours sequential memory accesses in the innermost loop, which also enables its vectorization in this source code snippet.
Performance Evaluation
The performance improvement achieved through loop interchange in the example matrix-matrix multiplication code is 2x-3x faster code on x86 and Arm processors.
Environment Linux Arm | Before Codee (seconds) | After Codee (seconds) | Speedup |
---|---|---|---|
CLANG 14-O3 -ffast-math | 353.74 | 181.34 | 48.73% (1.95x) |
GCC 11-O3 -ffast-math | 329.59 | 188.97 | 42.66% (1.74x) |
armclang-O3 -ffast-math | 357.39 | 181.73 | 49.15% (1.97x) |
Codee brings 2x faster code on Arm environments through loop interchange and vectorization
Environment Linux x86_64 | Before Codee (seconds) | After Codee (seconds) | Speedup |
---|---|---|---|
CLANG 14-O3 -ffast-math | 9.485 | 3.333 | 64.86% (2.85x) |
GCC 11-O3 -ffast-math | 8.757 | 3.457 | 64.86% (2.85x) |
ICC-O3 -ffast-math | 7.342 | 3.261 | 64.86% (2.85x) |
Codee brings 3x faster code on x86 environments through loop interchange and vectorization
From the technology perspective, Codee automates loop interchange and enables the efficient vectorization of the innermost loop. Note that GNU, LLVM and Intel compilers do not apply loop interchange in the maximum performance optimization level setup.
Innermost loopmatmul_naive:35 | GCC 11-O3 -ffast-math | CLANG 14-O3 -ffast-math | armclang-O3 -ffast-math | ICC-O3 -ffast-math | Codee |
---|---|---|---|---|---|
Loop interchange | YES | ||||
Loop vectorization | YES | NO(cost model) | NO(cost model) | YES | YES |
Loop peeling | YES | ||||
Loop interleaving | NO(cost model) | NO(cost model) | |||
Loop turned into non-loop | YES |
What to expect from Codee in the future?
We are continuously working on improving the capabilities of Codee in all aspects of software performance. We are working on the automation of more code variants related to Loop Interchange, including detection and source code rewriting. Stay tuned by subscribing to Codee newsletter!

Building performance into the code from day one with Codee
Leave a Reply