Loop fission is a code transformation technique that works by splitting a loop into multiple loops over the same index range, each one taking only a part of the original loop body. The goal is to isolate statements that can take advantage of compiler and hardware optimizations, separating them from other statements preventing those optimizations. It is worth exploring the application of loop fission in codes with the following features:
- Loop-carried dependencies between statements of the loop – e.g. when the data needed in the current iteration of the loop depends on the data from the previous iteration of the loop.
- Conditional statements in the loop body that depend on the data – e.g. when conditional statements in the loop body depend on input data, the execution path of the code cannot be anticipated and performance optimizations may be disabled at run-time.
- Non-sequential memory accesses in the loop – e.g. non-sequential memory accesses do not exhibit good locality of reference, thus the code does not take full advantage of the hardware cache memory and compiler optimizations may be disabled.
Loop fission enables more efficient code by taking advantage of memory efficiency, vectorization and offloading to accelerators.
Note
Loop fission introduces overheads (e.g. loop control increment and branching), so in general it is necessary to run and benchmark the code to determine if loop fission brings performance gain.
Better memory efficiency through loop fission #
Writing code that makes efficient use of memory is essential to write performant code for modern hardware. For example, loop fission enables writing smaller loops to achieve a better utilization of locality of reference. It can be more efficient in multithreaded code, as individual threads may benefit from single-core optimizations when running on multi-core processors. Note loop fission introduces overheads (e.g. loop control increment and branching), so in general it is necessary to run and benchmark the code to determine if loop fission brings performance gain.
Enabling vectorization through loop fission #
Writing code that makes efficient use of vectorization is essential to write performant code for modern hardware. For example, loop fission enables splitting an non-vectorizable loop into two or more loops. The goal of the fission is to isolate the statements preventing the vectorization into a dedicated loop. By doing this, we enable vectorization in the rest of the loop, which can lead to speed improvements. Note loop fission introduces overheads (e.g. loop control increment and branching), so in general it is necessary to run and benchmark the code to determine if loop fission brings performance gain.
Loop fission and offloading to accelerators #
Typically, the term offloading refers to moving the execution of a program or a part of the program from the CPU to hardware accelerators, such as GPUs. In contrast to CPUs, which offer limited parallelism, accelerators and GPUs are massively-parallel architectures that can speed up solving certain types of problems. The CPU and accelerators have distinct memories, and the major limiting factor to offloading performance is data movements, that is moving data from the memory of one to the memory of the other.
Writing performant code for accelerators is a complex time-consuming undertaking. Efficient GPU code must make an efficient use of memory through good locality of reference, of vectorization to enable fast execution in the GPU cores, and of parallelism to exploit the computational power of the thousands of GPU cores typically available in the hardware. Thus, loop fission offers a useful code transformation in the scope of GPU programming.