Best practices for performance


Modern CPU cores have special vector units that can process more than one piece of data in a single instruction. For example, a modern X86-64 CPU with AVX vector instructions has a set of different instructions that can work with 4 double values in one instruction: vector load instruction loads 4 double values into a vector register, vector add instruction adds together 2 registers each holding 4 double values, etc.

The process of using special vector instructions to speed up computation is called vectorization. If the code meets certain conditions, the compiler can emit vector instructions instead of scalar instructions in order to speed up computation. This process is called autovectorization.

For the compiler to be able automatically vectorize a loop, certain conditions must be fulfilled:

  • The loop must be countable: the number of iterations of the loop should be known before entering the loop
  • The loop should not have loop-carried dependencies: the general nature of parallelization doesn’t allow for loop-carried dependencies.
  • The loop should not contain complex conditional statements based on data: conditional statements based on data mean that the CPU is not executing the same instructions for the same data. Although the CPU can emulate branching by executing both sides of the branch and then using masking to select the correct result, in practice vectorization doesn’t pay off in the presence of complex conditional statements.
  • The data used by the loop should be accessed sequentially: sequential memory access uses the available memory bandwidth best. Accessing memory in any other ways puts additional pressure on the memory subsystem and often vectorization doesn’t pay off.
  • Arrays and vectors used in loops should not alias each other: pointer aliasing  is one of the reasons why compilers cannot vectorize a loop automatically, since in the presence of pointer aliasing the compiler cannot guarantee the correctness of results.

Sometimes compilers fail to vectorize certain loops automatically for various reasons. It is possible to use compiler pragmas either to force vectorization or provide additional information needed for vectorization. There are portable vectorization pragmas defined by OpenMP standard, but also there are compiler-specific vectorization pragmas as well.

Apart from relying on autovectorization, it is possible for the developer to manually write code that employs vector instructions: either pure assembly, C/C++ assembly intrinsics, or using one of the vectorization frameworks such as EVE