Two Studies in Compiler Optimisations

An in-depth look at how modern compilers like LLVM optimize code through internal passes, using modular increment as a case study to demonstrate the impact of the C++23 'assume' attribute.

Two studies in compiler optimisations

20 Mar 2026

Introduction

While many performance-oriented programmers are intimately acquainted with the almost preternatural ability of modern compilers to optimise their code, and many of us have spent countless hours on Compiler Explorer examining the differences between the Assembly generated by different versions of gcc and clang, most have likely not looked under the hood to see how the magic happens. It is a testament to their quality that most of us simply treat compilers as black boxes: more or less readable code goes in, fast binaries come out. Sometimes, however, seemingly innocuous changes—perhaps even meant to help the compiler—can cause surprising performance issues which we are hard-pressed to explain without a deeper understanding of the underlying machinery.

In this post we’ll dive into the implementation of some of the LLVM optimisation passes using two simple examples which, nonetheless, will help us pull back the veil on the complexity involved in producing highly-optimised code. We will see how small source changes can trigger different paths in the compiler’s internal processing with unexpected consequences, demonstrating how achieving high performance can be as much an art as it is a science for both compiler developers and users.

Case 1: Modular increment

The scenario

Consider the following C++ function to get the next index into an array or vector of elements accessed in a round-robin fashion, with cur being the current index and count the number of elements:

unsigned next_naive(unsigned cur, unsigned count)
{
    return (cur + 1) % count;
}

As written, this code requires an expensive 32-bit division instruction (6 cycles with 12-cycle latency on an Intel IceLake core, worse in previous generations). There are, of course, numerous tricks to replace divisions by constants with cheaper arithmetic operations—powers of two being the best-known case—but since here count is a dynamic runtime value the compiler cannot help us.

However, we know something the compiler doesn’t: because cur is an index, it must always be strictly smaller than count. We can take advantage of this to replace the division with a much cheaper conditional move by writing:

unsigned next_cmov(unsigned cur, unsigned count)
{
    auto n = cur + 1;
    return n == count ? 0 : n;
}

Could the compiler have done this transformation on its own? Let’s try out the new C++23 assume attribute:

unsigned next_assume(unsigned cur, unsigned count)
{
    [[assume(cur < count)]];
    return (cur + 1) % count;
}

Exactly the same code as for next_cmov()! So how did clang figure it out?

Peephole optimisations

When trying to understand where a certain optimisation comes from, the first step is to identify the pass where it occurs. Currently, the most convenient way to investigate the effects of the different LLVM optimisation passes is to use Compiler Explorer’s Opt Pipeline view.

When looking through the evolution of the LLVM IR for next_assume(), we quickly find that the transformation from a urem instruction (representing the modulo operation) to an icmp + select pair (which is eventually translated into an x86 cmp + cmov) happens in InstCombine.

InstCombine is responsible for many of the peephole optimisations that do not modify the control flow graph. It proceeds by visiting all the instructions in a function and matching them to one of many hundreds of patterns using a generic pattern matching mechanism.

In this case, the main transformation logic is found in InstCombineMulDivRem.cpp. The code matches the pattern where it can prove that the pre-increment value is less than the modulus. To make this guarantee, the compiler reuses the same analysis used to fold the result of icmp instructions, specifically checking assumptions applicable to the comparison via llvm.assume().

Exercise 1: If we try to be clever and rewrite our function using bitwise operations to avoid the potential branch, InstCombine sees right through our trickery and produces the exact same output:

unsigned next_bitwise(unsigned cur, unsigned count)
{
    auto n = cur + 1;
    return n & -(n != count);
}

Source: Hacker News

Two Studies in Compiler Optimisations