Most of the applications are fine when compiled with the -ffast-math
compiler flag, which makes floating-point computations faster at the expense of some precision loss. However, in the scientific domain, floating-point precision is important so most of the time those codebases are compiled without this option.
Still, the question of performance of such code remains unanswered. Sometimes, scientists are willing to sacrifice some part of precision if that will boost the performance. In this post we will talk about how you can use portable compiler pragmas to force vectorization on only some of the loops in your program. By forcing vectorization you trade precision for speed, but the trade is controlled in the sense that the precision loss is localized, you can measure it, and verify that your program’s numerical output is still correct (although a bit less precise than earlier).
Before start
For the compiler to generate efficient code it needs to be configured properly using compiler flags. Before applying the recommendations we are about to make, make sure to have used proper compilation flags to enable compilation optimizations that don’t affect precision but affect performance. In the previous post we covered one such compilation flag, and it is a prerequisite for this post. Please make sure you read the previous post before working through this one.
Example
Imagine there is a hot loop in your code that looks like this:
double rolling_average(double a[], int n) {
double result = 0.0;
for (int i = 0; i < n; i++) {
result += sqrt(a[i]) / n;
}
return result;
}
One of the optimizations the compiler could do is vectorization. Modern CPUs can process several pieces of data at a time. For example, modern X86-64 processors with Advanced Vector Extensions (most X86-64 processors manufactured in the last five years) can process four doubles with just one instruction. When translating the above code, the compiler might make the following vectorized version (in pseudo-code):
double rolling_average(double a[], int n) {
vector4<double> r = { 0.0, 0.0, 0.0, 0.0 };
for (int i = 0; i < n; i+=4) {
vector4<double> a_vec = load4<double>(a, i);
vector4<double> sqrt_vec = sqrt4<double>(a_vec);
vector4<double> div_vec = sqrt_vec / n;
r += div_vec;
}
double result = r[0] + r[1] + r[2] + r[3];
return result;
}
The above code represents the vectorized pseudo-assembly that the compiler might produce. Here, the vector4<double>
represents SIMD registers that can hold four double values. We load four doubles from the memory (line 5), then we perform four square roots on the loaded value (line 6), divide it by scalar value n
(line 7), and finally we accumulate it to the accumulator register r
(line 8). Each of these operations is done with one instruction, instead of the four instructions required without vectorization.

Subscribe to our newsletter
and get the latest tips and best practices from experts in software performance.
Although vectorization usually produces significant performance improvements, there is a caveat. Because of the limited precision of floating-point numbers, the associativity law doesn’t hold exactly. Since the vectorized code performs computations in a different order compared to its non-vectorized counterpart, there will be a slight difference in the result.
Non-vectorized computation order:
a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + a[6] + a[7] + ...
Vectorized computation order:
(a[0] + a[4] + a[8] + ...) + (a[1] + a[5] + a[9] + ...) + ... +
(a[3] + a[7] + a[11] + ...)
The compiler won’t perform any optimizations, including vectorization, that would break strict IEEE 754 compliance. In this case, the compiler would need to group additions differently to make it vectorizable, but under strict IEEE 754 semantics the results wouldn’t be the perfectly equal anymore.
If you take a look at the message CLANG reports for one such case, you will see that it suggests using either -ffast-math
or a compiler directive as a way of acknowledging that you are ok with a small precision loss and that it can proceed with the vectorization:
precision-omp-simd.c:51:16: remark: loop not vectorized:
cannot prove it is safe to reorder floating-point operations;
allow reordering by specifying '#pragma clang loop
vectorize(enable)' before the loop or by providing the compiler
option '-ffast-math'.
[-Rpass-analysis=loop-vectorize]
result += sqrt(a[i]) / n;
Forcing vectorization
Using -ffast-math
will typically cause a small loss of precision in your program. Enabling this option globally would be unacceptable because of precision loss, but you can tolerate the precision loss in certain parts of the code. For those parts, you can instruct the compiler to use vectorization with OpenMP SIMD directives (other compiler-specific directives are also available in GCC, Clang and ICC, among others).
Compilers will typically vectorize loops automatically without directives when there is no implicit precision loss. A notable case where compilers don’t vectorize loops is a loop pattern called scalar reduction. Scalar reduction is a loop that iterates over an array of values and reduces them to a single value (e.g. calculating a sum of all elements). Compilers won’t vectorize such loops automatically unless we specify compiler flags that break IEEE 754 strict compliance (e.g. -ffast-math
). Alternatively, you can force vectorization on important loops with OpenMP compiler directives. In our case, we force vectorization on our loop by applying one such directive, like this:
double rolling_average(double a[], int n) {
double result = 0.0;
#pragma omp simd reduction(+:result)
for (int i = 0; i < n; i++) {
result += sqrt(a[i]) / n;
}
return result;
}
The OpenMP #pragma omp simd reduction(+:result)
(line 4) tells the compiler to vectorize this loop, regardless of the precision loss that comes with vectorization. The compiler directive consists of the following parts:
pragma omp simd
: vectorize the loop using SIMD vectorizationreduction(+:result)
: each SIMD lane has its own copy of variableresult
. Elements of the array a[0], a[4], a[8] belong to lane 0, elements a[1], a[5], a[9] to lane 1, etc. Perform the computations on each lane separately, and when the computations are done, reduce all the results in all lanes to a single variableresult
using the + operator.
The proposed solution is portable across compilers, including GCC, CLANG, Intel’s compiler and MSVC. Alternatively, you can use compiler-specific pragmas for code vectorization (for instance, #pragma clang loop vectorize(enable)
in the case of CLANG).
We measured the performance improvements with the newly introduced pragma (source code in our repository). With CLANG 10 our example took 493 ms to execute; when we added OpenMP SIMD pragma, it took 126 ms. There was a slight loss of precision, the original result was 30894.898582933871, the new value is 30894.898582928719. It is up to the developer to determine if the precision loss is acceptable or not.
Summary
Precision is important but sometimes we are willing to sacrifice a bit of precision for a nice improvement in speed. OpenMP SIMD directives, which most compilers support, allow us to force vectorization on those parts of the code that are hot without introducing imprecision in other parts of the codebase. This solution is portable, alternatively, you can use compiler-specific pragmas to achieve the same 1.

Building performance into the code from day one with Codee
References
Run your floating-point calculations with both precision and speed
1 Compiler-specific pragmas for vectorization are a solution for those codebases that cannot use OpenMP SIMD directives, or for the cases where more control over vectorization is needed.
Leave a Reply