Cloudflare's Gen 13 servers: trading cache for cores for 2x performance

Cloudflare has launched its 13th generation servers powered by AMD EPYC 5th Gen 'Turin' processors. By transitioning to a Rust-based request handling layer (FL2), they successfully overcame cache limitations to unlock a 2x performance boost.

Two years ago, Cloudflare deployed our 12th Generation server fleet, based on AMD EPYC™ Genoa-X processors with their massive 3D V-Cache. That cache-heavy architecture was a perfect match for our request handling layer, FL1 at the time. But as we evaluated next-generation hardware, we faced a dilemma – the CPUs offering the biggest throughput gains came with a significant cache reduction. Our legacy software stack wasn't optimized for this, and the potential throughput benefits were being capped by increasing latency.

This blog describes how the FL2 transition, our Rust-based rewrite of Cloudflare's core request handling layer, allowed us to prove Gen 13's full potential and unlock performance gains that would have been impossible on our previous stack. FL2 removes the dependency on the larger cache, allowing for performance to scale with cores while maintaining our SLAs. Today, we are proud to announce the launch of Cloudflare's Gen 13 based on AMD EPYC™ 5th Gen Turin-based servers running FL2, effectively capturing and scaling performance at the edge.

What AMD EPYC Turin brings to the table

AMD's EPYC™ 5th Generation Turin-based processors deliver more than just a core count increase. The architecture delivers improvements across multiple dimensions of what Cloudflare servers require.

2x core count: up to 192 cores versus Gen 12's 96 cores, with SMT providing 384 threads

Improved IPC: Zen 5's architectural improvements deliver better instructions-per-cycle compared to Zen 4

Better power efficiency: Despite the higher core count, Turin consumes up to 32% fewer watts per core compared to Genoa-X

DDR5-6400 support: Higher memory bandwidth to feed all those cores

However, Turin's high density OPNs make a deliberate tradeoff: prioritizing throughput over per core cache. Our analysis across the Turin stack highlighted this shift. For example, comparing the highest density Turin OPN to our Gen 12 Genoa-X processors reveals that Turin's 192 cores share 384MB of L3 cache. This leaves each core with access to just 2MB, one-sixth of Gen 12's allocation. For any workload that relies heavily on cache locality, which ours did, this reduction posed a serious challenge.

| Generation | Processor | Cores/Threads | L3 Cache/Core | | :--- | :--- | :--- | :--- | | Gen 12 | AMD Genoa-X 9684X | 96C/192T | 12MB (3D V-Cache) | | Gen 13 Option 1 | AMD Turin 9755 | 128C/256T | 4MB | | Gen 13 Option 2 | AMD Turin 9845 | 160C/320T | 2MB | | Gen 13 Option 3 | AMD Turin 9965 | 192C/384T | 2MB |

For our FL1 request handling layer, NGINX- and LuaJIT-based code, this cache reduction presented a significant challenge. But we didn't just assume it would be a problem; we measured it. Using AMD uProf tool, the data showed L3 cache miss rates increased dramatically compared to Gen 12, and memory fetch latency dominated request processing time. L3 cache hits complete in roughly 50 cycles; L3 cache misses requiring DRAM access take 350+ cycles.

The tradeoff: latency vs. throughput

Our initial tests running FL1 on Gen 13 confirmed that while the Turin processor could achieve higher throughput, it came at a steep latency cost.

The Gen 13 evaluation server with AMD Turin 9965 generated 60% throughput gain, but a more than 50% latency penalty is not acceptable.

The opportunity: FL2 was already in progress

To truly unlock the performance potential of the Gen 13 architecture, we knew we would have to rewrite our software stack. Fortunately, we had already been rebuilding FL1 from the ground up. FL2 is a complete rewrite of our request handling layer in Rust, built on our Pingora and Oxy frameworks, replacing 15 years of NGINX and LuaJIT code.

FL2's cleaner architecture, with better memory access patterns and less dynamic allocation, doesn't depend on massive L3 caches the way FL1 did.

Proving it out: FL2 on Gen 13

As the FL2 rollout progressed, production metrics from our Gen 13 servers validated what we had hypothesized.

FL requests per CPU%: 50% higher than FL1
Latency vs Gen 12: 70% lower
Throughput vs Gen 12: 100% higher (2x gain)

By effectively eliminating the cache bottleneck, FL2 enables our throughput to scale linearly with core count. On the high-density AMD Turin 9965, we achieved a 2x performance gain, unlocking the full potential of the hardware.

Source: Hacker News