Last updated on September 15, 2022
In this blog post, we show how to get started with Codee using the Canny edge detection image processing algorithm. You will see how it supports the performance optimization roadmap by providing human-readable actionable items to enable the optimization of sequential code as well as the exploitation of the parallelism available in the target hardware platform.
A structured report displays the actionable items for the optimization of the sequential code, including faster scalar processing, simpler control flow and more efficient usage of the memory of the computer. The report also displays items for the optimization of parallel processing through vectorization, multithreading and offloading. You will see how to dig deeper into the details of each human-readable actionable item in order to understand where the performance issue is, what it is about and how to fix it.
Code structure
You can find the Canny code in our performance-demos public repository. It consists of a single canny.c file and a Makefile to build and run the code for a test image. Please note that the image is located in the parent folder in the repository and you need to ensure that it is located there for the make run target to work.
The algorithm itself essentially highlights the edges of a given image, which is illustrated in the following image:

How to build and run
As described before, there is a Makefile that you can use to build and run. By simply invoking make all
the targets will be invoked resulting in cleaning any previous builds, building the binary and invoking it for the test image:
$ make
rm -fr canny testvecs
cc canny.c -fopenmp -O3 -lm -o canny
unzip ../15360_8640.zip
Archive: ../15360_8640.zip
creating: testvecs/
creating: testvecs/input/
inflating: testvecs/input/15360_8640.pgm
./canny testvecs/input/15360_8640.pgm 0.5 0.7 0.9
Total time: 11.688
Using Codee
1. Ensure you have the latest version
Please run the following to ensure that you have the latest version of Codee matching the steps and outputs described here:
$ pwreport --version
Codee: 1.5.1 (rev 31fd1db053e6)
C/C++/Fortran source code syntactic parsing based on:
- Clang 13.0
- Flang 'fir-dev' branch (rev 0b95852b0d00)
- LLVM 13.0
Supported scientific libraries:
- libmath C11
- CBLAS (levels 1, 2 and 3) 3.10.0
- OpenMP 4.5
- OpenACC 2.7
Vectorization diagnosis of compilers:
- gcc 4.8.1 - 11.2
- clang 3.5 - 12.0.1
- icc 19.0 - 2021.1
Note that the output also shows the supported versions of third-party software tools, like compilers and parallel programming application program interfaces.
2. Get the performance optimization report of the whole code
You should always start by invoking the pwreport
tool:
$ pwreport --screening canny.c
SCREENING REPORT
Target Lines of code Analyzed lines Analysis time # actions Effort Cost Profiling
------- ------------- -------------- ------------- --------- ------ ------- ---------
canny.c 656 252 154 ms 91 264 h 8639€ n/a
------- ------------- -------------- ------------- --------- ------ ------- ---------
Total 656 252 154 ms 91 264 h 8639€ n/a
ACTIONS PER STAGE OF THE PERFORMANCE OPTIMIZATION ROADMAP
Target Scalar Control Memory Vector Multi Offload
------- ------ ------- ------ ------ ----- -------
canny.c 22 45 8 15 n/a n/a
------- ------ ------- ------ ------ ----- -------
Total 22 45 8 15 n/a n/a
TOTAL NUMBER OF LOOPS FOR CODEE AUTO AND GUIDED MODES
--------------- # actions ----------------
Codee support # Loops Scalar Control Memory Vector Multi Offload
------------- ------- ------ ------- ------ ------ ----- -------
Auto 1 0 0 0 2 n/a n/a
Guided 28 20 39 8 6 n/a n/a
Develop 5 1 5 0 7 n/a n/a
Roadmap 7 0 0 0 0 n/a n/a
Target : analyzed directory or source code file
Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analyzed lines : relevant lines of code successfully analyzed
Analysis time : time required to analyze the target
# actions : total actionable items (opportunities, recommendations, defects and remarks) detected
Effort : estimated number of hours it would take to carry out all actions (scalar, control, memory, vector, multi and offload with 1, 2, 4, 8, 12 and 16 hours respectively)
Cost : estimated cost in euros to carry out all the actions, paying the average salary of 56,286€/year for a professional C/C++ developer working 1720 hours per year
Profiling : estimation of overall execution time required by this target
Codee support : Codee support level given for the loop:
- Auto: the loop can be automatically optimized by Codee auto mode (pwdirectives --auto)
- Guided: Codee assists the user in order to optimize the loop
- Develop: Codee lacks support for the loop, but the feature is under development
- Roadmap: Codee lacks support for the loop, and the feature is on the long term roadmap
SUGGESTIONS
Use --level 1|2 to get more details, e.g:
pwreport --level 2 --screening canny.c
You can specify multiple inputs which will be displayed as multiple rows (ie. targets) in the table, eg:
pwreport --screening some/other/dir canny.c
Use --actions to find out details about the detected actions:
pwreport --actions canny.c
You can automatically vectorize every vectorizable loop of one function with:
pwdirectives --auto --simd omp --in-place canny.c
Multithreading and offloading actions are filtered by default. Use --include-tags to enable them:
pwreport --screening --include-tags all canny.c
You can focus on a specific optimization type by filtering by its tag (scalar, control, memory, vector, multi, offload), eg.:
pwreport --actions --include-tags scalar canny.c
1 file successfully analyzed and 0 failures in 154 ms
You can see that the first table provides high level metrics of the analysis of your code. It reports how many lines of code it has and how many of them were successfully analyzed, how long it took to do so, how many human-readable actionable items were detected and an estimation of the cost both in terms of hours and money to address all of them. Finally, this table also indicates “n/a”, meaning that there is no profiling information available to Code.
The second table provides a breakdown of the human-readable actions into the six stages of the performance optimization roadmap. The first three stages correspond to sequential optimizations that take advantage of faster scalar processing (Sequential Scalar), simpler control flow (Sequential Control Flow) and more efficient memory usage (Sequential Memory). The latter three stages correspond to performance optimization that exploit a type of parallelism, from vectorization to multithreading and offloading.
The suggestions provided at the bottom are key for the usage of Codee since they will provide you with hints on what your next steps could be.
Note the “n/a” in the Multithreading and Offloading columns of the second table. This is because Codee disables those actions by default in order to improve the out-of-the-box experience. We can enable this functionality explicitly, as suggested in the third suggestion, by using the flag --include-tags all
.
$ pwreport --screening --include-tags all canny.c
SCREENING REPORT
Target Lines of code Analyzed lines Analysis time # actions Effort Cost Profiling
------- ------------- -------------- ------------- --------- ------ -------- ---------
canny.c 656 252 154 ms 97 348 h 11388€ n/a
------- ------------- -------------- ------------- --------- ------ -------- ---------
Total 656 252 154 ms 97 348 h 11388€ n/a
ACTIONS PER STAGE OF THE PERFORMANCE OPTIMIZATION ROADMAP
Target Scalar Control Memory Vector Multi Offload
------- ------ ------- ------ ------ ----- -------
canny.c 22 45 8 15 3 3
------- ------ ------- ------ ------ ----- -------
Total 22 45 8 15 3 3
TOTAL NUMBER OF LOOPS FOR CODEE AUTO AND GUIDED MODES
--------------- # actions ----------------
Codee support # Loops Scalar Control Memory Vector Multi Offload
------------- ------- ------ ------- ------ ------ ----- -------
Auto 1 0 0 0 2 0 0
Guided 28 20 39 8 6 3 3
Develop 5 1 5 0 7 0 0
Roadmap 7 0 0 0 0 0 0
Target : analyzed directory or source code file
Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analyzed lines : relevant lines of code successfully analyzed
Analysis time : time required to analyze the target
# actions : total actionable items (opportunities, recommendations, defects and remarks) detected
Effort : estimated number of hours it would take to carry out all actions (scalar, control, memory, vector, multi and offload with 1, 2, 4, 8, 12 and 16 hours respectively)
Cost : estimated cost in euros to carry out all the actions, paying the average salary of 56,286€/year for a professional C/C++ developer working 1720 hours per year
Profiling : estimation of overall execution time required by this target
Codee support : Codee support level given for the loop:
- Auto: the loop can be automatically optimized by Codee auto mode (pwdirectives --auto)
- Guided: Codee assists the user in order to optimize the loop
- Develop: Codee lacks support for the loop, but the feature is under development
- Roadmap: Codee lacks support for the loop, and the feature is on the long term roadmap
SUGGESTIONS
Use --level 1|2 to get more details, e.g:
pwreport --level 2 --screening --include-tags all canny.c
You can specify multiple inputs which will be displayed as multiple rows (ie. targets) in the table, eg:
pwreport --screening some/other/dir --include-tags all canny.c
Use --actions to find out details about the detected actions:
pwreport --actions --include-tags all canny.c
You can automatically vectorize every vectorizable loop of one function with:
pwdirectives --auto --simd omp --in-place canny.c
You can focus on a specific optimization type by filtering by its tag (scalar, control, memory, vector, multi, offload), eg.:
pwreport --actions --include-tags scalar canny.c
1 file successfully analyzed and 0 failures in 154 ms
Now we can finally see all of the actions reported, including multithreading and offloading related actions.

Subscribe to our newsletter
and get the latest tips and best practices from experts in software performance.
The natural next step is to follow the second suggestion and use the --actions
analysis. Note that Codee requires the user to perform --actions
analysis over a specific target function or loop, and not over the entire source code. This is because otherwise the output would be too large and hard to digest by the user. It is always a good idea to start using the --actions
analysis over hotspot functions, as we will see in the next step.
3. Get the performance optimization report of the hotspot
From profiling, we know that the hotspot for this code corresponds to the gaussian_smooth
function. You can narrow the analysis to that function by appending :gaussian_smooth
to the filename. The pwreport
invocation is as follows:
$ pwreport --actions --include-tags all canny.c:gaussian_smooth
ACTIONS REPORT
FUNCTION BEGIN at canny.c:gaussian_smooth:439:1
439: void gaussian_smooth(unsigned char *image, int rows, int cols, float sigma,
LOOP BEGIN at canny.c:gaussian_smooth:474:4 (support: guided)
474: for(r=0;r<rows;r++){
LOOP BEGIN at canny.c:gaussian_smooth:475:7 (support: guided)
475: for(c=0;c<cols;c++){
LOOP BEGIN at canny.c:gaussian_smooth:478:10 (support: guided)
478: for(cc=(-center);cc<=center;cc++){
[RMK012] canny.c:478:10 the vectorization cost model states the loop is not a SIMD opportunity because conditional execution renders vectorization inefficient
LOOP END
[PWR002] canny.c:442:18 'cc' not declared in the innermost scope possible
[PWR002] canny.c:447:10 'dot' not declared in the innermost scope possible
[PWR002] canny.c:448:10 'sum' not declared in the innermost scope possible
LOOP END
[PWR002] canny.c:442:11 'c' not declared in the innermost scope possible
[PWR002] canny.c:442:18 'cc' not declared in the innermost scope possible
[PWR002] canny.c:447:10 'dot' not declared in the innermost scope possible
[PWR002] canny.c:448:10 'sum' not declared in the innermost scope possible
[OPP001] canny.c:474:4 is a multi-threading opportunity
[OPP003] canny.c:474:4 is an offload opportunity
LOOP END
LOOP BEGIN at canny.c:gaussian_smooth:492:4 (support: guided)
492: for(c=0;c<cols;c++){
LOOP BEGIN at canny.c:gaussian_smooth:493:7 (support: guided)
493: for(r=0;r<rows;r++){
LOOP BEGIN at canny.c:gaussian_smooth:496:10 (support: guided)
496: for(rr=(-center);rr<=center;rr++){
[PWR034] canny.c:496:10 avoid strided array access for variable 'tempim' to improve performance
[RMK010] canny.c:496:10 the vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
LOOP END
[PWR002] canny.c:442:14 'rr' not declared in the innermost scope possible
[PWR002] canny.c:447:10 'dot' not declared in the innermost scope possible
[PWR002] canny.c:448:10 'sum' not declared in the innermost scope possible
[PWR034] canny.c:493:7 avoid strided array access for variable 'smoothedim' to improve performance
LOOP END
[PWR002] canny.c:442:8 'r' not declared in the innermost scope possible
[PWR002] canny.c:442:14 'rr' not declared in the innermost scope possible
[PWR002] canny.c:447:10 'dot' not declared in the innermost scope possible
[PWR002] canny.c:448:10 'sum' not declared in the innermost scope possible
[OPP001] canny.c:492:4 is a multi-threading opportunity
[OPP003] canny.c:492:4 is an offload opportunity
LOOP END
FUNCTION END
METRICS SUMMARY
Total recommendations: 16
Total opportunities: 4
Total defects: 0
Total remarks: 2
SUGGESTIONS
Use --level 1|2 to get more details, e.g:
pwreport --level 2 --actions --include-tags all canny.c:gaussian_smooth
16 recommendations were found in your code, get more information with pwreport:
pwreport --actions --include-tags pwr canny.c:gaussian_smooth
4 opportunities for parallelization were found in your code, get more information with pwloops:
pwloops --include-tags all canny.c:gaussian_smooth
More details on the defects, recommendations and more in the Knowledge Base:
Knowledge base
1 file successfully analyzed and 0 failures in 50 ms
The hotspot analysis succeeds and a report is outputted with the following sections:
- ACTIONS REPORT: structured report with actionable insights per function and loop.
- CODE COVERAGE: summary of how much code could be analyzed.
- METRICS SUMMARY: aggregated summary of the actionable insights detected in the analysis.
- SUGGESTIONS: general Codee usage hints.
The CODE COVERAGE report shows that all the code was successfully analyzed and the METRICS SUMMARY shows the different actionable insights detected. As hinted in the SUGGESTIONS section at the end, you can add --level
to increase the level of the detail of the ACTIONS REPORT.
4. Dig deeper into the actionable insights for your hotspot
By invoking pwreport
with the --level 2
flag, we get a very verbose report. In order to narrow the analysis, we will invoke pwreport
for the target loop at line 474. The output is as follows:
$ pwreport --actions --level 2 --include-tags all canny.c:474
ACTIONS REPORT
. . .
[OPP001] canny.c:492:4 is a multi-threading opportunity
Compute patterns:
- 'forall' over the variable 'smoothedim'
SUGGESTION: use pwloops to get more details or pwdirectives to generate directives:
pwloops canny.c:gaussian_smooth:492:4 --include-tags all
pwdirectives --omp multi canny.c:gaussian_smooth:492:4 --in-place
More information on: https://www.codee.com/knowledge/opp001
. . .
You can see suggestions on how to use other tools of Codee: use pwloops
to get detailed information at the loop-level or pwdirectives
to rewrite the source code and create a performance-optimized version automatically.

Building performance into the code from day one with Codee
Request a demo ›
Particularly, for the hotspot loop of Canny, Codee reports the OPP001 opportunity, which indicates that you can optimize the performance of the loop by applying multi-threading to it. It also points out that you can pwdirectives
to rewrite the source code of the loop automatically. Let’s see how to do it.
5. Optimize the performance of your hotspot
Let’s give the latter a try to add multi-threading. First, let’s build and run canny
to see how long it takes for the sequential version to execute. You can use the Makefile to do so:
$ make
rm -fr canny testvecs
cc canny.c -fopenmp -O3 -lm -o canny
unzip ../15360_8640.zip
Archive: ../15360_8640.zip
creating: testvecs/
creating: testvecs/input/
inflating: testvecs/input/15360_8640.pgm
./canny testvecs/input/15360_8640.pgm 0.5 0.7 0.9
Total time: 11.688
Now copy the command suggested by pwreport
(note that using --in-place
will modify the file, you can use -o canny_omp.c
instead to create a new file):
$ pwdirectives --omp multi canny.c:gaussian_smooth:474:4 --in-place
Results for file 'canny.c':
Successfully parallelized loop at 'canny.c:gaussian_smooth:474:4' [using multi-threading]:
[INFO] canny.c:474:4 Parallel forall: variable 'tempim'
[INFO] canny.c:474:4 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] canny.c:474:4 Parallel region defined by OpenMP directive 'parallel'
[INFO] canny.c:474:4 Make sure there is no aliasing among variables: kernel, tempim
Successfully updated canny.c
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
Note that the hotspot function gaussian_smooth contains two groups of nested loops, and we have just optimized the first one. In order to optimize the second group as well, just execute once again pwreport –actions to discover that there is another opportunity for multithreading available:
$ pwreport --actions --level 2 --include-tags all canny.c:gaussian_smooth
ACTIONS REPORT
FUNCTION BEGIN at canny.c:gaussian_smooth:439:1
439: void gaussian_smooth(unsigned char *image, int rows, int cols, float sigma,
. . .
LOOP BEGIN at canny.c:gaussian_smooth:496:4 (difficulty: low)
. . .
[OPP001] canny.c:496:4 is a multi-threading opportunity
Compute patterns:
- 'forall' over the variable 'smoothedim'
SUGGESTION: use pwloops to get more details or pwdirectives to generate directives:
pwloops canny.c:gaussian_smooth:496:4 --include-tags all
pwdirectives --omp multi canny.c:gaussian_smooth:496:4 --in-place
More information on: https://www.codee.com/knowledge/opp001
. . .
LOOP END
. . .
FUNCTION END
Follow the suggestion again to do the same for that second group of loops:
$ pwdirectives --omp multi canny.c:gaussian_smooth:496:4 --in-place
Results for file 'canny.c':
Successfully parallelized loop at 'canny.c:gaussian_smooth:496:4' [using multi-threading]:
[INFO] canny.c:496:4 Parallel forall: variable 'smoothedim'
[INFO] canny.c:496:4 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] canny.c:496:4 Parallel region defined by OpenMP directive 'parallel'
[INFO] canny.c:496:4 Make sure there is no aliasing among variables: tempim, kernel
Successfully updated canny.c
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
Build and run again to compare the performance:
$ make
rm -fr canny testvecs
cc canny.c -fopenmp -O3 -lm -o canny
unzip ../15360_8640.zip
Archive: ../15360_8640.zip
creating: testvecs/
creating: testvecs/input/
inflating: testvecs/input/15360_8640.pgm
./canny testvecs/input/15360_8640.pgm 0.5 0.7 0.9
Total time: 5.942
On a laptop equipped with an AMD Ryzen 7 4800HS CPU (8 cores, 16 threads), the execution went from 11.7 to just 5.9 seconds: almost a 2x speedup!

Building performance into the code from day one with Codee
Leave a Reply