Idiomatic Koru Kernels Match Hand-Specialized C

A performance benchmark reveals that idiomatic Koru kernels can match the speed of hand-optimized C code by providing rich semantic information to the compiler, effectively closing the expertise gap.

We wanted Koru kernels to land in the same ballpark as idiomatic C, Rust, and Zig.

The result was stronger than that.

Our fused n-body kernel, written in straightforward Koru kernel style, came in faster than the plain reference implementations. Every implementation here is “naive” – the obvious, idiomatic version a competent programmer would write in each language. No tricks, no hand-tuning, no -ffast-math.

| Implementation | Relative | |---|---| | Koru (kernel fused) | 1.00 | | C | 1.14 | | Zig | 1.14 | | Rust | 1.17 | | SBCL (kernel DSL) | 1.88 | | SBCL (Common Lisp) | 2.03 | | GHC (Haskell) | 2.58 |

50 million iterations. Warmup 3. Multiple runs via hyperfine. Same machine. Same benchmark harness. All source code is available on GitHub.

A note on numbers: these are ballpark figures. Exact ratios shift slightly between benchmark runs depending on system load, thermal state, and scheduling. The relative ordering is stable – we’ve run this many times – but don’t read too much into the second decimal place.

That was already interesting. But it raised the obvious question: is Koru really producing unusually strong code here, or is the reference C just not shaped the way the optimizer wants?

So we followed up the honest way.

The Follow-Up

We wrote a fixed-size scalarized C version. No generic array-of-struct loop. No “for any N” shape. Just the exact five-body problem written in a form that mirrors what a compiler would like to see after aggressive specialization.

We compiled it twice: once with -ffast-math, once without. Here’s the full picture with the optimized variants included:

| Implementation | Relative | |---|---| | C (scalarized, no fast-math) | 1.00 | | Koru (kernel fused) | 1.00 | | C (scalarized, fast-math) | 1.00 | | C | 1.12 | | Zig | 1.14 | | Rust | 1.15 | | SBCL (kernel DSL) | 1.88 | | SBCL (Common Lisp) | 2.03 | | GHC (Haskell) | 2.58 |

The hand-specialized C closed the gap completely – and barely edged ahead.

That does not weaken the result. It sharpens it.

Why This Is Stronger, Not Weaker

The point was never “C is slow.” C is obviously not slow. And we would be worried if we couldn’t get hand-optimized C to run faster – that would mean the benchmark was broken, not that Koru was magic.

The point is that Koru kernel code is carrying enough semantic information that the compiler can lower it into code that is within 1% of expert hand-specialized C, without forcing the programmer to write the hand-specialized C in the first place.

That is a much stronger statement than “Koru beats C on one benchmark.”

The plain C reference already had restrict in the hot path. That means the programmer was already supplying aliasing knowledge manually. In other words, the C version was already asking the human to act as part-time optimizer.

This was the original hot loop shape:

static inline void advance(struct body * restrict b, double dt) {
    for (int i = 0; i < 5; i++) {
        for (int j = i + 1; j < 5; j++) {
            double dx = b[i].x - b[j].x;
            double dy = b[i].y - b[j].y;
            double dz = b[i].z - b[j].z;
            double dsq = dx * dx + dy * dy + dz * dz;
            double mag = dt / (dsq * sqrt(dsq));
            b[i].vx -= dx * b[j].mass * mag;
            b[i].vy -= dy * b[j].mass * mag;
            b[i].vz -= dz * b[j].mass * mag;
            b[j].vx += dx * b[i].mass * mag;
            b[j].vy += dy * b[i].mass * mag;
            b[j].vz += dz * b[i].mass * mag;
        }
    }
    for (int i = 0; i < 5; i++) {
        b[i].x += dt * b[i].vx;
        b[i].y += dt * b[i].vy;
        b[i].z += dt * b[i].vz;
    }
}

That is already good C. It is not naive. It uses restrict. It uses fixed loop bounds. It uses the standard dsq * sqrt(dsq) trick. It keeps the data in a global fixed array.

And it was still slower than the fused Koru kernel.

So what did the C version need in order to catch up?

What We Had To Do In C

We had to stop writing “normal benchmark reference C” and start writing something much closer to a manual lowering of the optimized shape.

These were the key changes:

scalarize each body into separate locals instead of indexing through b[i]
make the problem fixed-size in the source, not just in the loop bounds
spell out all ten pair interactions explicitly
keep masses as separate constants
update positions in one fused straight-line block

This is the kind of helper the specialized version uses:

static inline void advance_pair(
    double xi, double yi, double zi,
    double xj, double yj, double zj,
    double mi, double mj,
    double *restrict vxi, double *restrict vyi, double *restrict vzi,
    double *restrict vxj, double *restrict vyj, double *restrict vzj
) {
    const double dx = xi - xj;
    const double dy = yi - yj;
    const double dz = zi - zj;
    const double dsq = dx * dx + dy * dy + dz * dz;
    const double mag = DT / (dsq * sqrt(dsq));
    *vxi -= dx * mj * mag;
    *vyi -= dy * mj * mag;
    *vzi -= dz * mj * mag;
    *vxj += dx * mi * mag;
    *vyj += dy * mi * mag;
    *vzj += dz * mi * mag;
}

And then the timestep becomes this:

advance_pair(x0, y0, z0, x1, y1, z1, m0, m1, &vx0, &vy0, &vz0, &vx1, &vy1, &vz1);
// ... 9 more pairs ...
x0 += DT * vx0; y0 += DT * vy0; z0 += DT * vz0;
// ... 4 more bodies ...

That is the version that matched Koru.

To be clear: this is valid, good C. But it is no longer just “write the obvious program.” It is a human taking the semantic problem and manually reshaping it into a form the optimizer can exploit more aggressively.

That is the expertise gap Koru is trying to close.

Koru kernels make those facts part of the model:

the data shape is explicit
the relationships between elements are explicit
pairwise interaction is explicit
per-element update is explicit
aliasing constraints are implied by the kernel abstraction itself

| kernel k |>
std.kernel:step(0..iterations)
|> std.kernel:pairwise {
    const dx = k.x - k.other.x;
    // ... logic ...
}
|> std.kernel:self {
    k.x += DT * k.vx;
    // ... logic ...
}

The matching C ends up looking like a manual lowering pass. Ten explicit pair updates. Scalarized locals for every body. A shape that is closer to compiler IR than to the original problem statement.

That is exactly the point of Koru kernels.

The fast-math Surprise

An interesting detail: the scalarized C compiled without -ffast-math was consistently the fastest. Not by much, but consistently.

This matters because -ffast-math lets the C compiler reorder floating-point operations freely. For this particular workload, the IEEE-compliant evaluation order turned out to be better for the pipeline than whatever reordering -ffast-math chose.

Koru doesn’t use -ffast-math either. It gets its speed from struct

Source: Hacker News