In the previous post we talked about loop unswitching: a technique that the compilers use to make loops faster.** Loop unswitching consists of detecting conditional statements whose condition never changes during the execution of the loop (loop invariant conditions) and then moving the condition outside of the loop**.

Loop unswitching opens the door for other compiler optimizations, notability vectorization. However, sometimes the compilers opt-out of loop unswitching for the reasons we talked about in the previous post. If this is the case with your hot code, you can take the matters into your own hands and manually unswitch your hot loop.

In this post we talk about** manual loop unswitching, and later we investigate the performance of loop unswitching on an example inspired by the code of one of our customers**.

**Manual loop unswitching**

As you may have seen from the previous examples, loop unswitching is a fragile compiler optimization that often may not take place. However, there are ways to force it on the loops on the hot path.

Here is an example we used in the previous post to demonstrate loop unswitching:

```
int calculate_sum(int* a, int n, bool only_positives) {
int sum = 0;
for (int i = 0; i < n; i++) {
if (only_positives) {
if (a[i] > 0) {
sum += a[i];
}
} else {
sum += a[i];
}
}
return sum;
}
```

The variable `only_positives`

is a function parameter that never changes its value during the execution of the loop at line 4.

The compiler might or might not perform the loop unswitching automatically. However, you can manually perform the loop unswitching for high performance.

The simplest way to do it is to extract the loop invariant condition outside of the loop and create two loops, one per branch:

```
int calculate_sum(int* a, int n, bool only_positives) {
int sum = 0;
if (only_positives) {
for (int i = 0; i < n; i++) {
if (a[i] > 0) {
sum += a[i];
}
}
} else {
for (int i = 0; i < n; i++) {
sum += a[i];
}
}
return sum;
}
```

In the above code, we created two versions of the loop, one where the `only_positives`

is true and the other where it is false. We removed the unreachable code from the inside of the loop. The loop body is simpler, smaller and generally easier for the compiler to optimize.

**✉ Subscribe! Code performance tips for your inbox.**

However, this kind of manual loop unswitching results in duplicate loops and code that is more difficult to maintain. Luckly, there are ways to do it without code duplication.

If you are using C++, it is really simple to do it manually using templates. The original example rewritten with templates looks like this:

```
template <bool only_positives>
int calculate_sum(int* a, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
if (only_positives) {
if (a[i] > 0) {
sum += a[i];
}
} else {
sum += a[i];
}
}
return sum;
}
```

The only difference from the original code is that `only_positives`

is now a template parameter. The compiler basically creates two copies of the same function, one in with a constant value of true for `only_positives`

, the other with false. In order to invoke the function, the template parameter must be specified (eg. `calculate_sum<true>(a, n)`

) thus matching the proper copy of the function. The values of template parameters are known at compile time; if the value of `only_positives`

is true, the compiler will perform dead code elimination and remove the `else`

branch (lines 9-11). Otherwise, it will perform dead code elimination on the `if`

branch (lines 5-8).

If your project is using C, the same behavior can be made using macros. You can create a function `calculate_sum_only_positives_true`

for the case where the parameter is true and `calculate_sum_only_postives_false`

for the case when the parameter is false. Here is an example on how to do it:

```
#define CALCULATE_SUM(only_positives) \
int calculate_sum_only_positives_##only_positives(int* a, int n) { \
int sum = 0; \
for (int i = 0; i < n; i++) { \
if (only_positives) { \
if (a[i] > 0) { \
sum += a[i]; \
} \
} else { \
sum += a[i]; \
} \
} \
\
return sum; \
}
CALCULATE_SUM(true)
CALCULATE_SUM(false)
```

The above code snippet does exactly the same as C++ version. It creates two functions with names `calculate_sum_only_positives_true`

and `calculate_sum_only_postives_false`

. Dead code elimination makes sure that unused branches are cut off.

Alternatively, you can do loop unswitching inside the function:

```
int calculate_sum(int* a, int n, bool only_positives) {
int sum = 0;
#define BODY(only_positives_m) \
for (int i = 0; i < n; i++) { \
if (only_positives_m) { \
if (a[i] > 0) { \
sum += a[i]; \
} \
} else { \
sum += a[i]; \
} \
}
if (only_positives) {
BODY(true);
} else {
BODY(false);
}
#undef BODY
return sum;
}
```

We define a macro named `BODY`

that accepts argument `only_positives_m`

. On line 14 we check the value of function parameter `only_positives`

. If it is true, we instantiate the macro by providing the value true for `only_positives_m`

, otherwise, we instantiate the macro by providing false for `only_positives_m`

. The compiler will automatically eliminate branches with dead code.

Automatic loop unswitching done by the compilers closely resembles the loop unswitching we did here.

**How much does loop unswitching improve performance?**

We wanted to test the effect of loop unswitching on performance of our code. For this, we created an example inspired by the codebase provided to us by one of our customers. The source code is available in our repository. The code consists of two functions `array_calculation`

and `single_calculation`

. Function `array_calculation`

calls `single_calculation`

in every iteration of the loop. Here is the source code of `single_operation`

:

```
enum single_operation_e {
ADD_SUB,
MUL_DIV,
SIN_COS,
SQRT_ABS,
};
enum single_operation_e single_operation;
void single_calculation(int in1, int in2, float* out1, float* out2) {
switch (single_operation) {
case ADD_SUB:
*out1 = in1 + in2;
if (out2) {
*out2 = in1 - in2;
}
break;
case MUL_DIV:
*out1 = in1 * in2;
if (out2) {
*out2 = in1 / (in2 + (in2 == 0));
}
break;
case SIN_COS:
*out1 = sinf(in1);
if (out2) {
*out2 = cosf(in1);
}
break;
case SQRT_ABS:
*out1 = abs(in1);
if (out2) {
*out2 = sqrt(abs(in2));
}
break;
}
}
```

The function `single_calculation`

modifies its behavior based on two variables: the `single_operation`

(line 8) global variable, which has an enum type that determines what kind of operations we want to do, and the `out2`

(line 10) function parameter, to which the function will write an additional computed when its value is not `NULL`

.

The source code of the function `array_calculation`

looks like this:

```
enum array_operation_e {
ADD,
SUBTRACT,
REPLACE,
};
enum array_operation_e array_operation;
void array_calculation(float* a, float* b, int n, bool set_b) {
float r1, r2;
for (int i = 0; i < n; i++) {
single_calculation(i, i + 1, &r1, &r2);
if (array_operation == ADD) {
a[i] += r1;
if (set_b) {
b[i] += r2;
}
} else if (array_operation == SUBTRACT) {
a[i] -= r1;
if (set_b) {
b[i] -= r2;
}
} else {
a[i] = r1;
if (set_b) {
b[i] = r2;
}
}
}
}
```

This function also behaves differently based on two variables: `array_operation`

which is an enum variable defined in the global memory (line 7) and `set_b`

which is a boolean function parameter. Note that if `set_b`

is false, we are not using the value `r2`

returned by `single_calculation`

(lines 15, 20 and 25).

It is possible to manually do loop unswitching by creating a different function for each possible combination of parametrization values. For example, we can create one function `single_calculation_MUL_DIV_true`

which has a value `MUL_DIV`

for `single_operation`

and a variable `out2`

is not null, and another function `single_calculation_ADD_SUB_false`

where `single_operation`

is `ADD_SUB`

and variable `out2`

is null. We remove all the dead code manually as well as all invariant conditions.

Sometimes, this can be the way to go. But often, this approach leads to unmaintainable code. In C, we can use macros to create a generalized function body that we later specialize with macro parameters. In the case of the `single_calculation`

function, two parametrization variables `single_operation`

and `out2`

would yield eight different versions of the function `single_calculation`

. Here is how to do it with C macros:

```
#define SINGLE_CALCULATION(operation, set_out2) \
void single_calculation_##operation##_##set_out2( \
int in1, int in2, float* out1, float* out2) { \
switch (operation) { \
case ADD_SUB: \
*out1 = in1 + in2; \
if (set_out2) { \
*out2 = in1 - in2; \
} \
break; \
case MUL_DIV: \
*out1 = in1 * in2; \
if (set_out2) { \
*out2 = in1 / (in2 + (in2 == 0)); \
} \
break; \
case SIN_COS: \
*out1 = sinf(in1); \
if (set_out2) { \
*out2 = cosf(in1); \
} \
break; \
case SQRT_ABS: \
*out1 = abs(in1); \
if (set_out2) { \
*out2 = sqrt(abs(in2)); \
} \
break; \
} \
}
SINGLE_CALCULATION(ADD_SUB, true)
SINGLE_CALCULATION(ADD_SUB, false)
SINGLE_CALCULATION(MUL_DIV, true)
SINGLE_CALCULATION(MUL_DIV, false)
SINGLE_CALCULATION(SIN_COS, true)
SINGLE_CALCULATION(SIN_COS, false)
SINGLE_CALCULATION(SQRT_ABS, true)
SINGLE_CALCULATION(SQRT_ABS, false)
```

The macro `SINGLE_CALCULATION`

(lines 1-30) specifies a common body for all the later generated functions. At lines 32-39 we generate eight versions of the `single_calculation`

function, each version having a different name. For example, the function with the name `simple_calculation_ADD_SUB_true`

means that `single_operation`

has a value `ADD_SUB`

and `set_out2`

has a value `true`

. Dead code elimination makes sure that unused `switch`

branches (lines 11-28) and `set_out2`

condition check (line 7) are eliminated.

**Building performance into the code from day one with Codee Request a demo➝**

Function `array_calculation`

passes parameters to `single_calculation`

and therefore, we need to also make it parametrized in compile time. This is achieved similarly as with `single_calculation`

:

```
#define ARRAY_CALCULATION(array_operation, single_operation, set_b) \
void array_calculation_##array_operation##_##single_operation##_##set_b( \
float* a, float* b, int n) { \
float r1, r2; \
for (int i = 0; i < n; i++) { \
single_calculation_##single_operation##_##set_b(i, i + 1, &r1, \
&r2); \
if (array_operation == ADD) { \
a[i] += r1; \
if (set_b) { \
b[i] += r2; \
} \
} else if (array_operation == SUBTRACT) { \
a[i] -= r1; \
if (set_b) { \
b[i] -= r2; \
} \
} else { \
a[i] = r1; \
if (set_b) { \
b[i] = r2; \
} \
} \
} \
}
ARRAY_CALCULATION(ADD, ADD_SUB, true)
ARRAY_CALCULATION(ADD, ADD_SUB, false)
ARRAY_CALCULATION(ADD, MUL_DIV, true)
ARRAY_CALCULATION(ADD, MUL_DIV, false)
ARRAY_CALCULATION(ADD, SIN_COS, true)
ARRAY_CALCULATION(ADD, SIN_COS, false)
ARRAY_CALCULATION(ADD, SQRT_ABS, true)
ARRAY_CALCULATION(ADD, SQRT_ABS, false)
ARRAY_CALCULATION(SUBTRACT, ADD_SUB, true)
ARRAY_CALCULATION(SUBTRACT, ADD_SUB, false)
ARRAY_CALCULATION(SUBTRACT, MUL_DIV, true)
ARRAY_CALCULATION(SUBTRACT, MUL_DIV, false)
ARRAY_CALCULATION(SUBTRACT, SIN_COS, true)
ARRAY_CALCULATION(SUBTRACT, SIN_COS, false)
ARRAY_CALCULATION(SUBTRACT, SQRT_ABS, true)
ARRAY_CALCULATION(SUBTRACT, SQRT_ABS, false)
ARRAY_CALCULATION(REPLACE, ADD_SUB, true)
ARRAY_CALCULATION(REPLACE, ADD_SUB, false)
ARRAY_CALCULATION(REPLACE, MUL_DIV, true)
ARRAY_CALCULATION(REPLACE, MUL_DIV, false)
ARRAY_CALCULATION(REPLACE, SIN_COS, true)
ARRAY_CALCULATION(REPLACE, SIN_COS, false)
ARRAY_CALCULATION(REPLACE, SQRT_ABS, true)
ARRAY_CALCULATION(REPLACE, SQRT_ABS, false)
```

The common body of all `array_calculation`

functions is defined in macro `ARRAY_CALCULATION`

(lines 1-25). The function behavior is modified using three variables: `array_calculation`

, `single_operation`

and `set_b`

. Two of the variables are passed later to `single_calculation`

: `single_operation`

and `set_b`

.

The compiler needs to generate twenty four versions of `array_calculation`

(lines 27-50). Dead code elimination makes sure unused branches are removed. Note on line 6, that, for example, function `array_calculation_REPLACE_MUL_DIV_true`

calls the function `single_calculation_MUL_DIV_true`

. The value for variables single_operation which is MUL_DIV and for the set_out2 which is true are not passed to the function as function arguments; it is the name of the function (single_calculation_MUL_DIV_true) that marks what are the values of the two parameters inside.

**Experiment results**

Before talking about experiment results, a note:** in our experiment, all the code was inside one compilation unit **(a compilation unit is a single source file with all the headers inlined). This made it **easier for the compiler to do the optimizations, especially when it comes to figuring out if a condition is loop invariant or not.** In a real-world code base, the compiler will struggle to detect loop invariants involving several compilation units unless you enable Link-Time Optimizations (LTO), as explained in our previous post.

**✉ Subscribe! Don’t miss our next blog post.**

**The performance of loop unswitching depends**, as you will see later, **on the compiler**. Some compilers do this optimization more aggressively than others. We tested with three compilers: GCC 9.3.0, CLANG 10 and ICC 19.1. The experiments were executed on AMD Ryzen 7 4800H CPU with 16 cores and 16 GB of RAM memory on Ubuntu 20.04. We disabled processor frequency scaling in order to decrease runtime variance

The function `array_calculation`

receives two arrays as arguments, array `a`

and array `b`

. The function iterates over both arrays simultaneously, and for each pair `a[i]`

and `b[i]`

it calls the function `simple_calculation`

. Function `simple_calculation`

returns two values. Depending on the value of parameter `array_operation`

we either add to, subtract from or replace the original values `a[i]`

and `b[i]`

. If the parameter `set_b`

is true, we set both values for `a`

and `b`

, otherwise we skip changing the value for `b`

.

The function `simple_calculation`

returns two outputs using pointers `out1`

and `out2`

. If `out2`

is `NULL`

, we skip the calculation needed for `out2`

. The parameter `simple_operation`

defines what kind of operation is going to be performed on the input. We have four values for this parameter: ADD_SUB (add and subtract), MUL_DIV (multiply and divide), SIN_COS (sine and cosine) and SQRT_ABS (square root and abs).

As far as computational intensity is concerned, we expect that the runtime will be the longest when `simple_operation`

is `SIN_COS`

, followed by `MUL_DIV`

, `SQRT_ABS`

and `ADD_SUB`

. If parameter `set_b`

is true, we expect that the runtime is longer than when its value is false. As far as `array_calculation`

is concerned, we don’t expect a significant difference in runtimes since the computation is very light in all three cases (addition, subtraction and move).

We compared the runtimes of manual unswitching and automatic compiler unswitching. The full results for each possible parameter configuration are available in this link. Here we give a summary of the results.

**The fastest speedup was 21.2 times, the slowest speedup was close to 1 (no speedup)**. There was no speedup in 8.3 % of the measured configurations and the average speedup was 4.8. We didn’t see any case where there was a performance regression (ie. slowdown) related manual loop unswitching. Here is the speedups by percentile for all three compilers together:

Percentile | Speedup |
---|---|

0.1 | 1.09x |

0.25 | 1.80x |

0.5 | 3.51x |

0.75 | 6.9x |

0.9 | 10.35x |

Out of all measured samples, 90% had a speed up greater than 1.09, 75% had a speed up greater than 1.8, 50% had a speed up greater than 3.51, 25% had a speed up greater than 6.9 and 10% had a speed up greater than 10.35.

**Loop unswitching alone does provide some performance increase. However, the most significant speedups achieved here are a consequence of the reduced complexity of the loops thanks to the unswitching**. This opens the door to many other compiler optimizations that the compiler was unable to perform for the original, more complex, loop. **One notable such optimization is vectorization, as it is known that vectorized code can be several times faster than its non-vectorized counterpart.**

**Efficacy of the loop unswitching per compiler**

Percentiles per compiler give a picture of the compiler efficacy in doing loop unswitching automatically:

Percentile | GCC speedup | CLANG speedup | INTEL speedup |
---|---|---|---|

0.1 | 1.0x | 1.1x | 3.14x |

0.25 | 1.15x | 1.72x | 3.90x |

0.5 | 1.89x | 2.47x | 6.67x |

0.75 | 7.25x | 4.37x | 8.54x |

0.9 | 10.16x | 4.73x | 13.01x |

GCC seems to be the best compiler when it comes to doing automatic loop unswitching. In 50% of the measured configurations the speed up was less than 1.89. Intel’s compiler seems to be the worst, in 90% of the cases the speed up was greater than 3.14. CLANG seems to generate the most efficient code with the default settings. In our measurements, only in 10% of the measured configurations the speedup was larger than 4.73.

**Conclusion**

**Loop unswitching works by moving out loop invariant conditionals outside of the loop and generating a version of the loop for each possible condition value**. We showed techniques to manually perform loop unswitching to make sure the compiler doesn’t skip them and we measured the performance benefits of manual loop unswitching, which experimentation proved to be substantial.

Even in a simple testing environment such as ours, it is clear that the compilers use loop unswitching very sparingly. The reason is understandable: the compilers need to balance between speed and binary size**. Most of the time the compiler doesn’t know which loops are hot. If it would perform loop unswitching on all the loops, the binary size would grow too much. Therefore, most compilers will tend to avoid this optimization. However, this optimization is too important to miss out on your hot path and it is most likely worth performing manually.**

**Codee ****can help you with this**, for instance through its PWR022: Move invariant conditional out of the loop to facilitate vectorization recommendation. This will **hint you which loops have invariant conditions which can be extracted out of the loop.** By carefully inspecting your hotspot for these you can get significant performance gains!

**Building performance into the code from day one with Codee**

## Leave a Reply