NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...6 min read

Idiomatic Koru Kernels Match Hand-Specialized C

Share
NOW LET US Article – Idiomatic Koru Kernels Match Hand-Specialized C

A performance benchmark reveals that idiomatic Koru kernels can match the speed of hand-optimized C code by providing rich semantic information to the compiler, effectively closing the expertise gap.

Idiomatic Koru Kernels Match Hand-Specialized C

We wanted Koru kernels to land in the same ballpark as idiomatic C, Rust, and Zig.

The result was stronger than that.

Our fused n-body kernel, written in straightforward Koru kernel style, came in faster than the plain reference implementations. Every implementation here is “naive” – the obvious, idiomatic version a competent programmer would write in each language. No tricks, no hand-tuning, no -ffast-math.

| Implementation | Relative | |---|---| | Koru (kernel fused) | 1.00 | | C | 1.14 | | Zig | 1.14 | | Rust | 1.17 | | SBCL (kernel DSL) | 1.88 | | SBCL (Common Lisp) | 2.03 | | GHC (Haskell) | 2.58 |

50 million iterations. Warmup 3. Multiple runs via hyperfine. Same machine. Same benchmark harness. All source code is available on GitHub.

A note on numbers: these are ballpark figures. Exact ratios shift slightly between benchmark runs depending on system load, thermal state, and scheduling. The relative ordering is stable – we’ve run this many times – but don’t read too much into the second decimal place.

That was already interesting. But it raised the obvious question: is Koru really producing unusually strong code here, or is the reference C just not shaped the way the optimizer wants?

So we followed up the honest way.

The Follow-Up

We wrote a fixed-size scalarized C version. No generic array-of-struct loop. No “for any N” shape. Just the exact five-body problem written in a form that mirrors what a compiler would like to see after aggressive specialization.

We compiled it twice: once with -ffast-math, once without. Here’s the full picture with the optimized variants included:

| Implementation | Relative | |---|---| | C (scalarized, no fast-math) | 1.00 | | Koru (kernel fused) | 1.00 | | C (scalarized, fast-math) | 1.00 | | C | 1.12 | | Zig | 1.14 | | Rust | 1.15 | | SBCL (kernel DSL) | 1.88 | | SBCL (Common Lisp) | 2.03 | | GHC (Haskell) | 2.58 |

The hand-specialized C closed the gap completely – and barely edged ahead.

That does not weaken the result. It sharpens it.

Why This Is Stronger, Not Weaker

The point was never “C is slow.” C is obviously not slow. And we would be worried if we couldn’t get hand-optimized C to run faster – that would mean the benchmark was broken, not that Koru was magic.

The point is that Koru kernel code is carrying enough semantic information that the compiler can lower it into code that is within 1% of expert hand-specialized C, without forcing the programmer to write the hand-specialized C in the first place.

That is a much stronger statement than “Koru beats C on one benchmark.”

The plain C reference already had restrict in the hot path. That means the programmer was already supplying aliasing knowledge manually. In other words, the C version was already asking the human to act as part-time optimizer.

This was the original hot loop shape:

static inline void advance(struct body * restrict b, double dt) {
    for (int i = 0; i < 5; i++) {
        for (int j = i + 1; j < 5; j++) {
            double dx = b[i].x - b[j].x;
            double dy = b[i].y - b[j].y;
            double dz = b[i].z - b[j].z;
            double dsq = dx * dx + dy * dy + dz * dz;
            double mag = dt / (dsq * sqrt(dsq));
            b[i].vx -= dx * b[j].mass * mag;
            b[i].vy -= dy * b[j].mass * mag;
            b[i].vz -= dz * b[j].mass * mag;
            b[j].vx += dx * b[i].mass * mag;
            b[j].vy += dy * b[i].mass * mag;
            b[j].vz += dz * b[i].mass * mag;
        }
    }
    for (int i = 0; i < 5; i++) {
        b[i].x += dt * b[i].vx;
        b[i].y += dt * b[i].vy;
        b[i].z += dt * b[i].vz;
    }
}

That is already good C. It is not naive. It uses restrict. It uses fixed loop bounds. It uses the standard dsq * sqrt(dsq) trick. It keeps the data in a global fixed array.

And it was still slower than the fused Koru kernel.

So what did the C version need in order to catch up?

What We Had To Do In C

We had to stop writing “normal benchmark reference C” and start writing something much closer to a manual lowering of the optimized shape.

These were the key changes:

  • scalarize each body into separate locals instead of indexing through b[i]
  • make the problem fixed-size in the source, not just in the loop bounds
  • spell out all ten pair interactions explicitly
  • keep masses as separate constants
  • update positions in one fused straight-line block

This is the kind of helper the specialized version uses:

static inline void advance_pair(
    double xi, double yi, double zi,
    double xj, double yj, double zj,
    double mi, double mj,
    double *restrict vxi, double *restrict vyi, double *restrict vzi,
    double *restrict vxj, double *restrict vyj, double *restrict vzj
) {
    const double dx = xi - xj;
    const double dy = yi - yj;
    const double dz = zi - zj;
    const double dsq = dx * dx + dy * dy + dz * dz;
    const double mag = DT / (dsq * sqrt(dsq));
    *vxi -= dx * mj * mag;
    *vyi -= dy * mj * mag;
    *vzi -= dz * mj * mag;
    *vxj += dx * mi * mag;
    *vyj += dy * mi * mag;
    *vzj += dz * mi * mag;
}

And then the timestep becomes this:

advance_pair(x0, y0, z0, x1, y1, z1, m0, m1, &vx0, &vy0, &vz0, &vx1, &vy1, &vz1);
// ... 9 more pairs ...
x0 += DT * vx0; y0 += DT * vy0; z0 += DT * vz0;
// ... 4 more bodies ...

That is the version that matched Koru.

To be clear: this is valid, good C. But it is no longer just “write the obvious program.” It is a human taking the semantic problem and manually reshaping it into a form the optimizer can exploit more aggressively.

That is the expertise gap Koru is trying to close.

Koru kernels make those facts part of the model:

  • the data shape is explicit
  • the relationships between elements are explicit
  • pairwise interaction is explicit
  • per-element update is explicit
  • aliasing constraints are implied by the kernel abstraction itself
| kernel k |>
std.kernel:step(0..iterations)
|> std.kernel:pairwise {
    const dx = k.x - k.other.x;
    // ... logic ...
}
|> std.kernel:self {
    k.x += DT * k.vx;
    // ... logic ...
}

The matching C ends up looking like a manual lowering pass. Ten explicit pair updates. Scalarized locals for every body. A shape that is closer to compiler IR than to the original problem statement.

That is exactly the point of Koru kernels.

The fast-math Surprise

An interesting detail: the scalarized C compiled without -ffast-math was consistently the fastest. Not by much, but consistently.

This matters because -ffast-math lets the C compiler reorder floating-point operations freely. For this particular workload, the IEEE-compliant evaluation order turned out to be better for the pipeline than whatever reordering -ffast-math chose.

Koru doesn’t use -ffast-math either. It gets its speed from struct

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Treating pancreatic tumours may have revealed cancer's master switch

dev-tools

Treating pancreatic tumours may have revealed cancer's master switch

A promising new drug called daraxonrasib has shown breakthrough results in treating pancreatic cancer, doubling median survival times. This achievement could pave the way for an entirely new class of cancer treatments.

NOW LET US Related – Leaving Mozilla

dev-tools

Leaving Mozilla

A poignant and candid reflection from a 15-year Mozilla veteran upon their departure. The author highlights the leadership's missteps in trying to emulate tech giants and urges Mozilla to return to its core values: community and uniqueness.

NOW LET US Related – Shepherd's Dog: A Game by the Most Dangerous AI Model

dev-tools

Shepherd's Dog: A Game by the Most Dangerous AI Model

A developer tested Anthropic's latest, supposedly 'too dangerous' AI model by asking it to build a long-held game idea in a single shot. The model succeeded, generating a complete 2,319-line game after a 45-minute reasoning session.

NOW LET US Related – Open source AI must win

dev-tools

Open source AI must win

If artificial intelligence becomes a utility rented only from a few closed institutions, humanity loses its operational freedom. Open-source AI is a vital infrastructure for the future of our digital society.

NOW LET US Related – Statement on US government directive to suspend access to Fable 5 and Mythos 5

dev-tools

Statement on US government directive to suspend access to Fable 5 and Mythos 5

The US government has issued an export control directive forcing Anthropic to suspend all access to its Fable 5 and Mythos 5 models due to national security concerns, a move the AI safety startup strongly disputes.

NOW LET US Related – Electric motors with no rare earths

dev-tools

Electric motors with no rare earths

Renault Group is pioneering the development of electrically excited synchronous motors (EESM) that eliminate the need for rare earth magnets, reducing dependency on global monopolies while driving efficiency and sustainability.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.