Speed up non-vectorizable loops with loop fission

Vectorization is a powerful technique for achieving peak computational performance. However, not all code is easily vectorizable by all compilers. In this post we are going to talk about vectorization of complex non-vectorizable loops. The idea is to split the loop into two loops, one for the vectorizable part and the other for the non-vectorizable part. If the vectorizable part is computation-heavy, the performance improvements that come with vectorization should “hide” the fact that we are iterating over the same dataset two times.

Vectorization

The idea behind vectorization is to speed up the computation by working on several pieces of data instead of one. For instance, modern x86-64 CPUs with Advanced Vector Extensions can process 4 doubles in one instruction. Using vector instructions can significantly increase the speed of your code.

Loop vectorization works best if there are no conditional statements (eg. if, switch) in the loop body. Unfortunately, not all loops are free of conditionals, and many times they cannot be avoided. Depending on the type of condition this can either prevent vectorization or make it inefficient.

There are many reasons why compilers don’t vectorize loops. In this post we present two types of non-vectorizable loops due to conditionals, but generally, any type of loop that has a lot of computations in it and the compiler doesn’t vectorize automatically can profit from the information we are about to present.

Non-vectorizable loops due to conditionals

Here we give an example of two loops that can be vectorized with creative use of vectorization intrinsics, however, most compilers don’t vectorize them automatically. The loops in question are: a loop with a conditional break statement in its body and a search loop. Here is the source code of a loop with a conditional break in the loop body:

double break_loop_naive(double in[], int n, double flag) {
    double sum = 0.0;

    for (int i = 0; i < n; i++) {
        if (in[i] == flag) {
            break;
        }

        sum += calculate(in[i]);
    }

    return sum;
}

The loop iterates over the input array in and calls calculate on each element of the input. The returned value is then added to the accumulator sum. If, at one point, the value of the currently processed element is equal to flag, the computation stops.

We deliberately omitted the source code of the function calculate(), because, as you will see later, the performance of the transformation will depend on its arithmetic intensity. Here we define arithmetic intensity as a ratio between bytes fetched from the memory and the number of arithmetic operations (you can read more about arithmetic intensity in our previous post about roofline model).

Here is another example of non-vectorizable loop:

double search_loop_naive(double in[], int n, int* out_max_index) {
    int index = -1;
    double max = 0.0;
    double sum = 0.0;

    for (int i = 0; i < n; i++) {
        double r = calculate(in[i]);

        if (r > max) {
            max = r;
            index = i;
        }

        sum += r;
    }

    *out_max_index = index;
    return sum;
}

The loop is similar to the previous one. It also iterates over the input array in and calls calculate on each element of the input. The returned value is added to the accumulator sum. Additionally, it finds the element of the array that has the maximum value of calculate(in[i]).

The obstacle to vectorization of the two loops are the conditional statements in the loop bodies. In the first loop, the loop trip count is not known, since the loop can be interrupted at any time using break. The second loop should in principle be vectorizable since it follows a scalar reduction pattern with MAX operator, but CLANG and GCC don’t know how to vectorize this pattern.

Loop fission

If the calculate function is arithmetically intense, it would profit greatly from vectorization. However, the conditional statements in the loop bodies prevent it. The idea is to perform loop fission, i.e. to split the loop into two loops. The first loop would in that case perform the non-vectorizable part and the second loop would perform the vectorizable part of the computations.

The idea is not too difficult to do, here is the code of the loop with break statement before and after the loop fission:

double break_loop_naive(double in[], int n, double flag) {
    double sum = 0.0;

    for (int i = 0; i < n; i++) {
        if (in[i] == flag) {
            break;
        }

        sum += calculate(in[i]);
    }

    return sum;
}

double break_loop_split(double in[], int n, double flag) {
    double sum = 0.0;
    int m;

    for (m = 0; m < n; m++) {
        if (in[m] == flag) {
            break;
        }
    }

    for (int i = 0; i < m; i++) {
        sum += calculate(in[i]);
    }

    return sum;
}

The loop fission is straightforward. The first loops find the number of iterations for the second loop by iterating through the input array until it finds the flag value. The second loop performs the calculation.

Ideally, the compiler should be able to vectorize the second loop since it’s trivially vectorizable.

Here is the code of the search loop before and after the loop fission:

double search_loop_naive(double in[], int n, int* out_max_index) {
    int index = -1;
    double max = 0.0;
    double sum = 0.0;

    for (int i = 0; i < n; i++) {
        double r = calculate(in[i]);

        if (r > max) {
            max = r;
            index = i;
        }

        sum += r;
    }

    *out_max_index = index;
    return sum;
}

double search_loop_split(double in[], int n, int* out_max_index) {
    int index = -1;
    double max = 0.0;
    double sum = 0.0;

    double* tmp = malloc(sizeof(double) * n);

    for (int i = 0; i < n; i++) {
        double r = calculate(in[i]);
        tmp[i] = r;
        sum += r;
    }


    for (int i = 0; i < n; i++) {
        if (tmp[i] > max) {
            max = tmp[i];
            index = i;
        }
    }

    free(tmp);

    *out_max_index = index;
    return sum;
}

In this case, the loop fission looks more complicated. The first loop stores the result of the calculation in an array tmp. We use it in the second loop to find the element with the largest computation. In this case we expect that the compiler vectorizes the first loop.

A few words about loop fission

Loop fission isn’t a solution to every vectorization problem, and you need to pay attention to a few details in order to get it right:

Loops fission typically puts more pressure on the memory subsystem: splitting the loop into two would often mean that the same data is read from memory more than once. Because of this, when this code is distributed to multiple CPU cores, the obtained speedup factor might be smaller than the speedup factor for the unfissioned loop, however, the fissioned loop will probably execute faster.
The memory access pattern is important for loop fission: loops with an inefficient memory access pattern are bad candidates for loop fission, for the same reason as in the previous bullet. Inefficient memory access pattern puts a larger pressure on the memory subsystem. You can expect most improvements from the loops with a sequential access pattern; other access patterns will probably not yield performance improvements.
Loop fission makes sense on non-vectorizable loops with high arithmetic intensity. A high arithmetic intensity means that the code is performing relatively expensive mathematical operations: divisions, modulo, square roots, trigonometric functions, logarithms on relatively few input values.

Measurements

We performed measurements using three different compilers (GCC 9.3, CLANG 10 and ICC 19.1)¹. We compared the baseline with the fissioned loop on an input of 100 million doubles.

Subscribe to our newsletter

and receive in-depth technical articles, white papers, videos, webinars, product announcements, and more.

Measurements for the loop with a break in its body

Here are the results when the compute function calculates cos(sqrt(fabs(in))) (high arithmetic intensity) for the loop with conditional break in the loop body:

	Baseline	With loop fission	Speedup
GCC 9.3	3.17 s	0.34 s	9.3x
CLANG 10	3.17 s	2.71 s	1.2x
ICC 19.1	1.24 s	0.3 s	4.1x

GCC and ICC profit a lot from loop fission. In the case of GCC the speedup is 9x, in the case of ICC the speedup is 4x. In the case of CLANG, the improvements are marginal, but the reason is that the CLANG’s math library that implements cos function is inefficient compared to Intel’s (SVML) and GCC’s (libmvec)². After the loop fission, all three compilers vectorized the arithmetically intensive loop, but CLANG didn’t do it efficiently.

What happens when the arithmetic intensity is low? We measured the performance when the function compute calculates fabs(in). This is a simple mathematical operation. Here are the results:

	Baseline	With loop fission	Speedup
GCC 9.3	0.11 s	0.12 s	0.9x
CLANG 10	0.11 s	0.10 s	1.1x
ICC 19.1	0.11 s	0.09 s	1.2x

When the arithmetic intensity of the vectorized part is low, the loop fission doesn’t necessarily pay off.

Measurements for the search loop

Here are the results when the compute function calculates cos(sqrt(fabs(in))) (high arithmetic intensity) for the search and add loop:

	Baseline	With loop fission	Speedup
GCC 9.3	3.26 s	0.58 s	5.6x
CLANG 10	3.24 s	2.93 s	1.1x
ICC 19.1	0.32 s	0.59 s	0.5x

In this case, GCC gets a large speedup of over 5x; there is a very modest speed increase with CLANG. With ICC, the fissioned loop is almost two times slower. In the case of ICC, the compiler managed to vectorize the baseline loop which results in a very good performance for the baseline. This is a clear example where an advanced compiler manages to vectorize what the typical compiler doesn’t.

When the arithmetic intensity is low, numbers look different. When the compute function calculates fabs(in), here are the results:

	Baseline	With loop fission	Speedup
GCC 9.3	0.11 s	0.34 s	0.3x
CLANG 10	0.08 s	0.32 s	0.3x
ICC 19.1	0.04 s	0.28 s	0.1x

In all cases, the version with loop fission is slower.

Summary

Loop fission is a good approach to vectorize loops that are difficult to vectorize automatically by the compiler. They are applicable on those loops with high arithmetic intensity and a sequential memory access pattern. If those two conditions are met you can expect great performance improvements from the fission.

1 All the tests were executed on AMD Ryzen 7 4800H CPU with 16 cores and 16 GB of RAM memory on Ubuntu 20.04. We disabled processor frequency scaling in order to decrease runtime variance.

2 This has been fixed in CLANG 12 with -fveclib=libmvec switch that allows using GCC’s math vector library on CLANG

Build correct, secure, modern and fast Fortran, C and C++ scientific software

See our plans

Book a demo

Cookie	Type	Duration	Description
	0
__asc	0	30 minutes
__auc	0	1 year
__bs_id	0	1 year
__cfduid	1	1 month	The cookie is set by CloudFare. The cookie is used to identify individual clients behind a shared IP address d apply security settings on a per-client basis. It doesnot correspond to any user ID in the web application and doesn't store any personally identifiable information.
__gads	0	1 year	This cookie is set by Google and stored under the name dounleclick.com. This cookie is used to track how many times users see a particular advert which helps in measuring the success of the campaign and calculate the revenue generated by the campaign. These cookies can only be read from the domain that it is set on so it will not track any data while browsing through another sites.
__lxGr__ses	0	15 minutes
__lxGr__var_654116	0	15 minutes
__lxGr__var_654122	0	15 minutes
__lxGr__var_654124	0	15 minutes
__lxGr__var_654130	0	15 minutes
__lxGr__var_654134	0	15 minutes
__lxGr__var_654146	0	15 minutes
__lxGr__var_654157	0	15 minutes
__lxGr__var_654161	0	15 minutes
__lxGr__var_654163	0	15 minutes
__lxGr__var_654165	0	15 minutes
__lxGr__var_654333	0	15 minutes
__stid	0	1 year	The cookie is set by ShareThis. The cookie is used for site analytics to determine the pages visited, the amount of time spent, etc.
__stidv	0	1 year
__utma	0	2 years	This cookie is set by Google Analytics and is used to distinguish users and sessions. The cookie is created when the JavaScript library executes and there are no existing __utma cookies. The cookie is updated every time data is sent to Google Analytics.
__utmb	0	30 minutes	The cookie is set by Google Analytics. The cookie is used to determine new sessions/visits. The cookie is created when the JavaScript library executes and there are no existing __utma cookies. The cookie is updated every time data is sent to Google Analytics.
__utmc	0		The cookie is set by Google Analytics and is deleted when the user closes the browser. The cookie is not used by ga.js. The cookie is used to enable interoperability with urchin.js which is an older version of Google analytics and used in conjunction with the __utmb cookie to determine new sessions/visits.
__utmt	0	10 minutes	The cookie is set by Google Analytics and is used to throttle the request rate.
__utmt_onm	0	10 minutes
__utmz	0	6 months	This cookie is set by Google analytics and is used to store the traffic source or campaign through which the visitor reached your site.
_abck	0	1 year
_cb	0	1 year
_cb_ls	0	1 year
_cb_svref	0	30 minutes
_chartbeat2	0	1 year
_fbp	0	2 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
_ga	0	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.
_ga_Y5Q8GRTQY9	0	2 years
_gat	0	1 minute	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gat_bgs	0	1 minute
_gat_gtag_UA_25587466_8	0	1 minute	Google uses this cookie to distinguish users.
_gat_gtag_UA_84471197_16	0	1 minute	Google uses this cookie to distinguish users.
_gat_hearst	0	1 minute
_gat_tDelegDominio	0	1 minute
_gat_tDominio	0	1 minute
_gat_tRollupComscore	0	1 minute
_gat_tRollupDelegacion	0	1 minute
_gat_tRollupGlobal	0	1 minute
_gat_tRollupLvgTotal	0	1 minute
_gat_tRollupNivel1	0	1 minute
_gat_UA-5144860-2	0	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gid	0	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.
_kuid_	0	5 months	The cookie is set by Krux Digital under the domain krxd.net. The cookie stores a unique ID to identify a returning user for the purpose of targeted advertising.
_li_ss	0	1 month

Cookie	Type	Duration	Description
__cfduid	1	1 month	The cookie is set by CloudFare. The cookie is used to identify individual clients behind a shared IP address d apply security settings on a per-client basis. It doesnot correspond to any user ID in the web application and doesn't store any personally identifiable information.
cookielawinfo-checkbox-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	0	1 hour	This cookie is set by GDPR Cookie Consent plugin. The purpose of this cookie is to check whether or not the user has given the consent to the usage of cookies under the category 'Necessary'.
cookielawinfo-checkbox-non-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	0	1 hour	This cookie is set by GDPR Cookie Consent plugin. The purpose of this cookie is to check whether or not the user has given their consent to the usage of cookies under the category 'Non-Necessary'.
DSID	1	1 hour	To note specific user identity. Contains hashed/encrypted unique ID.
JSESSIONID	1		Used by sites written in JSP. General purpose platform session cookies that are used to maintain users' state across page requests.
PHPSESSID	0		This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
pmpro_visit	0		The cookie is set by the Paid Membership Pro plugin. The cookie is used to manage user memberships.
viewed_cookie_policy	0	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	0	1 hour	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Type	Duration	Description
__gads	0	1 year	This cookie is set by Google and stored under the name dounleclick.com. This cookie is used to track how many times users see a particular advert which helps in measuring the success of the campaign and calculate the revenue generated by the campaign. These cookies can only be read from the domain that it is set on so it will not track any data while browsing through another sites.
__stid	0	1 year	The cookie is set by ShareThis. The cookie is used for site analytics to determine the pages visited, the amount of time spent, etc.
_ga	0	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.
_gat_gtag_UA_25587466_8	0	1 minute	Google uses this cookie to distinguish users.
_gat_gtag_UA_84471197_16	0	1 minute	Google uses this cookie to distinguish users.
_gid	0	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.
ad-id	1	7 months	Provided by amazon-adsystem.com for tracking user actions on other websites to provide targeted content
demdex	0	5 months	This cookie is set under the domain demdex.net and is used by Adobe Audience Manager to help identify a unique visitor across domains.
GPS	0	30 minutes	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location
pardot	0		The cookie is set when the visitor is logged in as a Pardot user.
tk_lr	0	1 year	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack
tk_or	0	5 years	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack
tk_r3d	0	3 days	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience

Cookie	Type	Duration	Description
_fbp	0	2 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
_kuid_	0	5 months	The cookie is set by Krux Digital under the domain krxd.net. The cookie stores a unique ID to identify a returning user for the purpose of targeted advertising.
ad-privacy	1	5 years	Provided by amazon-adsystem.com for tracking user actions on other websites to provide targeted content to the users.
ATN	1	2 years	The cookie is set by atdmt.com. The cookies stores data about the user behavior on multiple websites. The data is then used to serve relevant advertisements to the users on the website.
dpm	0	5 months	The cookie is set by demdex.net. This cookie assigns a unique ID to each visiting user that allows third-party advertisers target that users with relevant ads.
everest_g_v2	0	1 year	The cookie is set under eversttech.net domain. The purpose of the cookie is to map clicks to other events on the client's website.
fr	1	2 months	The cookie is set by Facebook to show relevant advertisements to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1	2 years	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
khaos	0	1 year	This cookie is set by rubiconproject.com. The cookie is used to store user data in an anonymous form such as the IP address, geographical location, websites visited, and the ads clicked. The purpose of the cookie is to tailor the ads displayed to the users based on the users movement on other site in the same ad network.
ljt_reader	0	1 year	This is a Lijit Advertising Platform cookie. The cookie is used for recognizing the browser or device when users return to their site or one of their partner's site.
mako_uid	0	1 year	This cookie is set under the domain ps.eyeota.net. The cookies is used to collect data about the users' visit to the website such as the pages visited. The data is used to create a users' profile in terms of their interest and demographic. This data is used for targeted advertising and marketing.
mc	0	1 year	This cookie is associated with Quantserve to track anonymously how a user interact with the website.
NID	1	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
p2	0	1 week	The cookies is set by ownerIQ for the purpose of providing relevant advertisement.
personalization_id	0	2 years	This cookie is set by twitter.com. It is used integrate the sharing features of this social media. It also stores information about how the user uses the website for tracking and targeting.
PUBMDCID	0	2 months	This cookie is set by pubmatic.com. The cookie stores an ID that is used to display ads on the users' browser.
si	0	5 years	The cookies is set by ownerIQ for the purpose of providing relevant advertisement.
TDCPM	0	1 year	The cookie is set by CloudFare service to store a unique ID to identify a returning users device which then is used for targeted advertising.
TDID	0	1 year	The cookie is set by CloudFare service to store a unique ID to identify a returning users device which then is used for targeted advertising.
test_cookie	0	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the users' browser supports cookies.
tuuid	0	1 year	This cookie is set by .bidswitch.net. The cookies stores a unique ID for the purpose of the determining what adverts the users have seen if you have visited any of the advertisers website. The information is used for determining when and how often users will see a certain banner.
tuuid_lu	0	1 year	This cookie is set by .bidswitch.net. The cookies stores a unique ID for the purpose of the determining what adverts the users have seen if you have visited any of the advertisers website. The information is used for determining when and how often users will see a certain banner.
uid	0	1 month	This cookie is used to measure the number and behavior of the visitors to the website anonymously. The data includes the number of visits, average duration of the visit on the website, pages visited, etc. for the purpose of better understanding user preferences for targeted advertisements.
uuid	0	1 month	To optimize ad relevance by collecting visitor data from multiple websites such as what pages have been loaded.
VISITOR_INFO1_LIVE	1	5 months	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
__utma	0	2 years	This cookie is set by Google Analytics and is used to distinguish users and sessions. The cookie is created when the JavaScript library executes and there are no existing __utma cookies. The cookie is updated every time data is sent to Google Analytics.
__utmb	0	30 minutes	The cookie is set by Google Analytics. The cookie is used to determine new sessions/visits. The cookie is created when the JavaScript library executes and there are no existing __utma cookies. The cookie is updated every time data is sent to Google Analytics.
__utmc	0		The cookie is set by Google Analytics and is deleted when the user closes the browser. The cookie is not used by ga.js. The cookie is used to enable interoperability with urchin.js which is an older version of Google analytics and used in conjunction with the __utmb cookie to determine new sessions/visits.
__utmt	0	10 minutes	The cookie is set by Google Analytics and is used to throttle the request rate.
__utmz	0	6 months	This cookie is set by Google analytics and is used to store the traffic source or campaign through which the visitor reached your site.
_gat	0	1 minute	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gat_UA-5144860-2	0	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
AMP_TOKEN	0	1 hour	This cookie is set by Google Analytics - This cookie contains a token that can be used to retrieve a Client ID from AMP Client ID service. Other possible values indicate opt-out, inflight request or an error retrieving a Client ID from AMP Client ID service.
audit	0	1 year	This cookie is set by Rubicon Project and is used for recording cookie consent data.
bcookie	0	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	0		This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	0	1 day	This cookie is set by LinkedIn and used for routing.
mailchimp_landing_site	0	4 weeks	The cookie is set by the email marketing service MailChimp.
na_id	0	1 year	This cookie is set by Addthis.com to enable sharing of links on social media platforms like Facebook and Twitter
ouid	0	1 year	The cookie is set by Addthis which enables the content of the website to be shared across different networking and social sharing websites.
pid	0	1 year	Helps users identify the users and lets the users use twitter related features from the webpage they are visiting.
PugT	0	1 month	This cookie is set by pubmatic.com. The purpose of the cookie is to check when the cookies were last updated on the browser in order to limit the number of calls to the server-side cookie store.
sid	0		This cookie is very common and is used for session state management.
test_cookie	0	11 months	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the users' browser supports cookies.
YSC	1		This cookies is set by Youtube and is used to track the views of embedded videos.

Speed up non-vectorizable loops with loop fission

Vectorization

Non-vectorizable loops due to conditionals

Loop fission

A few words about loop fission

Measurements

Measurements for the loop with a break in its body

Measurements for the search loop

Summary

PRODUCT

COMPANY

SEARCH

NEWSLETTER

Vectorization

Non-vectorizable loops due to conditionals

Loop fission

A few words about loop fission

Measurements

Measurements for the loop with a break in its body

Measurements for the search loop

Summary

Reader Interactions

Leave a Reply Cancel reply

Footer

PRODUCT

COMPANY

SEARCH

NEWSLETTER