A OpenACC offload version of the function can be generated by the compiler.
In the context of OpenACC, when a loop is offloaded to the GPU, the compiler creates the proper instructions understandable by the GPU which are distinct from those of the CPU. In this way, offloaded sections are translated into mini GPU programs embedded into the main CPU program. The runtime is in charge of executing those mini-programs in the GPU, as well as of doing the proper data movement between the CPU and GPU memories. If an offloaded loop invokes functions, a GPU-version of those functions must also be created. In order for the compiler to create a GPU version of the function, the relevant function must be marked with the OpenACC
routine directive. When this is not done, the CPU version will be called instead, with the corresponding performance loss due to moving computation from the GPU to the CPU to execute the function and then back to the GPU once it returns.
Annotate the function with
#pragma acc routine.
Building performance into the code from day one with Codee