Issue #
OpenMP teams should be used to distribute work offloaded to the GPU.
Relevance #
GPUs are notably different from CPUs, being composed by a high number of processing units instead of a low level of cores. Moreover, these processing units are organized following a hierarchy within the GPU that requires some specific setup in order to better exploit its capabilities.
The OpenMP parallel
construct specifies a parallel region of the code that will be executed by a team of threads. It is normally accompanied by a worksharing construct so that each thread of the team takes care of part of the work (e.g the for
construct assigns a subset of the loop iterations to each thread). This attains a single level of parallelism since all work is distributed across a team of threads. This works well for multi-core CPUs but GPUs are composed of a high number of processing units organized into groups that can share memory and synchronize. This must be taken into account in order to get the better performance out of GPUs.
The OpenMP teams distribute
construct can be used to introduce an additional level of parallelism by creating multiple teams of threads and distributing loop iterations across them. Each team forms a contention group, meaning that threads can only synchronize with other threads in its team. This allows the work to be distributed better fitting the hierarchical organization of the processing units of GPUs. Additionally, using teams enhances performance portability, ensuring a more predictable performance no matter which compiler and hardware combination is used.
Actions #
Use the target teams distribute parallel for
combined construct to offload work to the GPU using two levels of parallelism.
Code example #
The following code offloads a matrix multiplication computation through the target
construct and then creates a parallel region and distributes the work through for
construct (note that the matrices are statically sized arrays):
#pragma omp target map(to: A[0:m][0:p], B[0:p][0:n], m, n, p) map(tofrom: C[0:m][0:n])
{
#pragma omp parallel default(none) shared(A, B, C, m, n, p)
{
#pragma omp for schedule(auto)
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
for (size_t k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
} // end parallel
} // end target
When offloading to the GPU it is recommended to use an additional level of parallelism. This can be achieved by using the teams
and distribute
constructs, in this case in combination with parallel for
:
#pragma omp target teams distribute parallel for map(to: A[0:m][0:p], B[0:p][0:n], m, n, p) shared(A, B, m, n, p) map(tofrom: C[0:m][0:n]) schedule(auto)
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
for (size_t k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
Related resources #
- PWR009 examples at GitHub
- OpenMP 4.5 Complete Specifications, November 2015 [last checked June 2020]
- Portability of OpenMP Offload Directives – Jeff Larkin, OpenMP Booth Talk SC17, November 2017 [last checked June 2020]
- OpenMP and NVIDIA – Jeff Larkin, NVIDIA Developer Technologies [last checked June 2020]