Vectorization is a powerful technique for achieving peak computational performance. However, not all code is easily vectorizable by all compilers. In this post we are going to talk about vectorization of complex non-vectorizable loops. The idea is to split the loop into two loops, one for the vectorizable part and the other for the non-vectorizable part. If the vectorizable part is computation-heavy, the performance improvements that come with vectorization should “hide” the fact that we are iterating over the same dataset two times.
Vectorization
The idea behind vectorization is to speed up the computation by working on several pieces of data instead of one. For instance, modern x86-64 CPUs with Advanced Vector Extensions can process 4 doubles in one instruction. Using vector instructions can significantly increase the speed of your code.
Loop vectorization works best if there are no conditional statements (eg. if, switch) in the loop body. Unfortunately, not all loops are free of conditionals, and many times they cannot be avoided. Depending on the type of condition this can either prevent vectorization or make it inefficient.
There are many reasons why compilers don’t vectorize loops. In this post we present two types of non-vectorizable loops due to conditionals, but generally, any type of loop that has a lot of computations in it and the compiler doesn’t vectorize automatically can profit from the information we are about to present.
Non-vectorizable loops due to conditionals
Here we give an example of two loops that can be vectorized with creative use of vectorization intrinsics, however, most compilers don’t vectorize them automatically. The loops in question are: a loop with a conditional break
statement in its body and a search loop. Here is the source code of a loop with a conditional break
in the loop body:
double break_loop_naive(double in[], int n, double flag) {
double sum = 0.0;
for (int i = 0; i < n; i++) {
if (in[i] == flag) {
break;
}
sum += calculate(in[i]);
}
return sum;
}
The loop iterates over the input array in
and calls calculate
on each element of the input. The returned value is then added to the accumulator sum
. If, at one point, the value of the currently processed element is equal to flag
, the computation stops.

Subscribe to our newsletter
and get the latest tips and best practices from experts in software performance.
We deliberately omitted the source code of the function calculate()
, because, as you will see later, the performance of the transformation will depend on its arithmetic intensity. Here we define arithmetic intensity as a ratio between bytes fetched from the memory and the number of arithmetic operations (you can read more about arithmetic intensity in our previous post about roofline model).
Here is another example of non-vectorizable loop:
double search_loop_naive(double in[], int n, int* out_max_index) {
int index = -1;
double max = 0.0;
double sum = 0.0;
for (int i = 0; i < n; i++) {
double r = calculate(in[i]);
if (r > max) {
max = r;
index = i;
}
sum += r;
}
*out_max_index = index;
return sum;
}
The loop is similar to the previous one. It also iterates over the input array in
and calls calculate
on each element of the input. The returned value is added to the accumulator sum
. Additionally, it finds the element of the array that has the maximum value of calculate(in[i])
.
The obstacle to vectorization of the two loops are the conditional statements in the loop bodies. In the first loop, the loop trip count is not known, since the loop can be interrupted at any time using break
. The second loop should in principle be vectorizable since it follows a scalar reduction pattern with MAX operator, but CLANG and GCC don’t know how to vectorize this pattern.
Loop fission
If the calculate
function is arithmetically intense, it would profit greatly from vectorization. However, the conditional statements in the loop bodies prevent it. The idea is to perform loop fission, i.e. to split the loop into two loops. The first loop would in that case perform the non-vectorizable part and the second loop would perform the vectorizable part of the computations.
The idea is not too difficult to do, here is the code of the loop with break statement before and after the loop fission:
double break_loop_naive(double in[], int n, double flag) {
double sum = 0.0;
for (int i = 0; i < n; i++) {
if (in[i] == flag) {
break;
}
sum += calculate(in[i]);
}
return sum;
}
double break_loop_split(double in[], int n, double flag) {
double sum = 0.0;
int m;
for (m = 0; m < n; m++) {
if (in[m] == flag) {
break;
}
}
for (int i = 0; i < m; i++) {
sum += calculate(in[i]);
}
return sum;
}
The loop fission is straightforward. The first loops find the number of iterations for the second loop by iterating through the input array until it finds the flag
value. The second loop performs the calculation.
Ideally, the compiler should be able to vectorize the second loop since it’s trivially vectorizable.
Here is the code of the search loop before and after the loop fission:
double search_loop_naive(double in[], int n, int* out_max_index) {
int index = -1;
double max = 0.0;
double sum = 0.0;
for (int i = 0; i < n; i++) {
double r = calculate(in[i]);
if (r > max) {
max = r;
index = i;
}
sum += r;
}
*out_max_index = index;
return sum;
}
double search_loop_split(double in[], int n, int* out_max_index) {
int index = -1;
double max = 0.0;
double sum = 0.0;
double* tmp = malloc(sizeof(double) * n);
for (int i = 0; i < n; i++) {
double r = calculate(in[i]);
tmp[i] = r;
sum += r;
}
for (int i = 0; i < n; i++) {
if (tmp[i] > max) {
max = tmp[i];
index = i;
}
}
free(tmp);
*out_max_index = index;
return sum;
}
In this case, the loop fission looks more complicated. The first loop stores the result of the calculation in an array tmp
. We use it in the second loop to find the element with the largest computation. In this case we expect that the compiler vectorizes the first loop.
A few words about loop fission
Loop fission isn’t a solution to every vectorization problem, and you need to pay attention to a few details in order to get it right:
- Loops fission typically puts more pressure on the memory subsystem: splitting the loop into two would often mean that the same data is read from memory more than once. Because of this, when this code is distributed to multiple CPU cores, the obtained speedup factor might be smaller than the speedup factor for the unfissioned loop, however, the fissioned loop will probably execute faster.
- The memory access pattern is important for loop fission: loops with an inefficient memory access pattern are bad candidates for loop fission, for the same reason as in the previous bullet. Inefficient memory access pattern puts a larger pressure on the memory subsystem. You can expect most improvements from the loops with a sequential access pattern; other access patterns will probably not yield performance improvements.
- Loop fission makes sense on non-vectorizable loops with high arithmetic intensity. A high arithmetic intensity means that the code is performing relatively expensive mathematical operations: divisions, modulo, square roots, trigonometric functions, logarithms on relatively few input values.

Building performance into the code from day one with Codee
Request a demo ›
Measurements
We performed measurements using three different compilers (GCC 9.3, CLANG 10 and ICC 19.1)1. We compared the baseline with the fissioned loop on an input of 100 million doubles. The source code is available in our repository.
Measurements for the loop with a break in its body
Here are the results when the compute
function calculates cos(sqrt(fabs(in)))
(high arithmetic intensity) for the loop with conditional break
in the loop body:
Baseline | With loop fission | Speedup | |
---|---|---|---|
GCC 9.3 | 3.17 s | 0.34 s | 9.3x |
CLANG 10 | 3.17 s | 2.71 s | 1.2x |
ICC 19.1 | 1.24 s | 0.3 s | 4.1x |
GCC and ICC profit a lot from loop fission. In the case of GCC the speedup is 9x, in the case of ICC the speedup is 4x. In the case of CLANG, the improvements are marginal, but the reason is that the CLANG’s math library that implements cos
function is inefficient compared to Intel’s (SVML) and GCC’s (libmvec)2. After the loop fission, all three compilers vectorized the arithmetically intensive loop, but CLANG didn’t do it efficiently.
What happens when the arithmetic intensity is low? We measured the performance when the function compute
calculates fabs(in)
. This is a simple mathematical operation. Here are the results:
Baseline | With loop fission | Speedup | |
---|---|---|---|
GCC 9.3 | 0.11 s | 0.12 s | 0.9x |
CLANG 10 | 0.11 s | 0.10 s | 1.1x |
ICC 19.1 | 0.11 s | 0.09 s | 1.2x |
When the arithmetic intensity of the vectorized part is low, the loop fission doesn’t necessarily pay off.
Measurements for the search loop
Here are the results when the compute
function calculates cos(sqrt(fabs(in)))
(high arithmetic intensity) for the search and add loop:
Baseline | With loop fission | Speedup | |
---|---|---|---|
GCC 9.3 | 3.26 s | 0.58 s | 5.6x |
CLANG 10 | 3.24 s | 2.93 s | 1.1x |
ICC 19.1 | 0.32 s | 0.59 s | 0.5x |
In this case, GCC gets a large speedup of over 5x; there is a very modest speed increase with CLANG. With ICC, the fissioned loop is almost two times slower. In the case of ICC, the compiler managed to vectorize the baseline loop which results in a very good performance for the baseline. This is a clear example where an advanced compiler manages to vectorize what the typical compiler doesn’t.
When the arithmetic intensity is low, numbers look different. When the compute
function calculates fabs(in)
, here are the results:
Baseline | With loop fission | Speedup | |
---|---|---|---|
GCC 9.3 | 0.11 s | 0.34 s | 0.3x |
CLANG 10 | 0.08 s | 0.32 s | 0.3x |
ICC 19.1 | 0.04 s | 0.28 s | 0.1x |
In all cases, the version with loop fission is slower.
Summary
Loop fission is a good approach to vectorize loops that are difficult to vectorize automatically by the compiler. They are applicable on those loops with high arithmetic intensity and a sequential memory access pattern. If those two conditions are met you can expect great performance improvements from the fission.

Building performance into the code from day one with Codee
1 All the tests were executed on AMD Ryzen 7 4800H CPU with 16 cores and 16 GB of RAM memory on Ubuntu 20.04. We disabled processor frequency scaling in order to decrease runtime variance.
2 This has been fixed in CLANG 12 with -fveclib=libmvec
switch that allows using GCC’s math vector library on CLANG
Leave a Reply