Modern Rendering Culling Techniques

An in-depth look at modern rendering culling techniques, explaining why manual optimization remains crucial in the age of AI and how methods like Hi-Z and Occlusion Culling boost performance.

Intro

In the modern era of AI coding, “AI game generation”, DLSS 5, Unreal Engine 5, and phenomenal Gaussian Splat demos, people tend to think graphics and games are solved problems. “Just grab AI and start building games within days,” they say. Obviously that’s bullshit. The hard engineering work, knowledge, tradeoffs, and art direction are not going anywhere. Whether your game is 2D or 3D, realistic or cartoonish, set in a closed Mars base or an open-world zombie-infested New York, you still need to optimize it. One of the most important optimizations every game has used, and will keep using, is culling.

Good news: almost 80% of the optimizations I’ve seen over my career boil down to “don’t do extra stupid work when you don’t need to.”

Bad news: you still need to implement culling while balancing scene structure, game design, art direction, hardware limits, and performance budgets.

So this article walks through the main culling techniques used in modern real-time renderers. I’ll group them by category so it’s easier to see how they relate to each other. Almost every one of these techniques deserves its own article, because as always, the devil is in the details.

1. The Basics: Distance, Backface, and Frustum

These are the cheapest and most universally applied techniques. They catch the obvious cases before anything more expensive runs.

Distance Culling

The simplest form: if an object is farther than some max distance from the camera, skip it. That’s it.

This is trivially fast and works well for small props where the visual impact of disappearing is minimal. Most engines let you set a cull distance per mesh or per material.

The tricky part is avoiding visible pop-in. Common mitigations are dithered fade-out, aggressive LOD before the cull point, or impostors (billboards that replace the real mesh at distance).

This is covered in more detail in the Screen Size Culling section below, but it’s worth flagging here: if something projects to only a handful of pixels, it’s often not worth the cost to draw. Distance alone doesn’t catch that cleanly - you also want a screen-space size check.

Backface Culling

This is the first culling technique you’ll usually encounter when working with a graphics API because it’s configured as part of the pipeline state object (PSO) and is one of the easiest wins to enable.

Every triangle has a front face and a back face. For closed meshes, back faces are never visible because they’re inside the object. The GPU can automatically skip them based on winding order, which saves roughly half the rasterization and fragment work for typical geometry.

One thing worth knowing: in a traditional vertex + fragment pipeline, backface culling happens after the vertex shader has already processed the vertices. So you don’t save vertex work, only rasterization and fragment work. In more GPU-driven pipelines, you can move this decision earlier, for example in compute or task/amplification work that culls meshlets before they ever reach rasterization.

This is mostly free, but it’s worth understanding because it interacts with transparency, two-sided materials, and some culling algorithms that exploit it explicitly.

Frustum Culling

For a perspective camera, the view frustum is the truncated pyramid-shaped volume that represents what the camera can see. Anything outside of it doesn’t need to be rendered. Frustum culling tests objects, usually via bounding volumes like spheres or AABBs, against the six planes of the frustum and skips anything that doesn’t intersect.

This is almost always the first pass in a culling pipeline, or second after distance culling. It’s fast, cheap, and can cut a huge chunk of the scene in one shot, especially in open worlds where large portions of the map are behind or beside the camera.

Notice in the gif above that big objects like mountains are still rendered even when they’re almost outside the frustum. This is the core tradeoff with object-level culling: many small objects give you fine-grained culling opportunities but each one is a draw call and a CPU-side visibility test. A handful of large objects is cheap on draw calls, but you’re stuck rendering the whole thing even when 90% of its triangles are offscreen - and you pay vertex shader cost for all of them, since the rasterizer clips after vertex shading, not before. That wasted vertex work on off-screen geometry is exactly the problem meshlet culling in section 4 solves.

2. Occlusion Culling

Occlusion culling tells you what’s behind other things. It’s harder but often gives you the biggest win in dense scenes like cities or interiors.

Hardware Occlusion Queries

All major graphics APIs expose occlusion-query-style features. Direct3D 12 has query heaps, Vulkan has occlusion queries, and Metal has visibility result buffers. The idea is the same: render proxy geometry, typically the object’s bounds, and count whether any samples passed the depth test. Zero visible samples means the proxy was fully occluded from that view, so the real object can usually be skipped.

In DX12 you’d use D3D12_QUERY_TYPE_BINARY_OCCLUSION which returns just 0 or 1 rather than an exact sample count - cheaper and enough for culling.

The catch is latency and synchronization. Results only become visible to the CPU after the GPU finishes, so in practice you often read frame N’s results while rendering frame N+1. That one-frame lag is usually acceptable, but it can briefly keep rendering something that just became occluded, or skip something that just became visible.

Software Occlusion Culling (CPU)

Instead of asking the GPU, you rasterize a low-resolution depth buffer on the CPU and test objects against it. Intel’s Masked Software Occlusion Culling (MSOC) is probably the most well-known implementation here. It uses SIMD to rasterize triangles in 8x4 pixel tiles and can process millions of triangles per second.

The upside is zero readback latency since it all happens on the CPU before you submit anything to the GPU. The downside is CPU cost and the need to maintain a separate simplified occluder mesh, since you can’t afford to rasterize your full scene geometry.

Hi-Z (Hierarchical Z-Buffer)

Hi-Z is a mip chain of the depth buffer, often called a depth pyramid, where each level stores a conservative depth value for a larger region of the screen.

To test whether an object is occluded, you project its bounds to screen space, choose the mip level that roughly matches its footprint, and compare the object’s nearest depth against the pyramid. For a conventional LESS depth test this pyramid often stores the maximum depth in each region; with reversed-Z it is typically the minimum. The important part is that the representation stays conservative. If the test says “occluded”, you can safely skip the object. If not, you keep it. Good implementations prefer false negatives over false positives.

This is the basis for most GPU-driven occlusion culling today. It’s fast to build and query, and it lives entirely on the GPU.

Two-Pass Occlusion Culling

A common pattern in GPU-driven renderers: use the previous frame’s Hi-Z to cull objects before rendering the current frame. The simple version is one pass: cull everything against last frame’s Hi-Z, render what survives. It’s cheap, but objects that just became visible get wrongly culled. A second pass is then used to re-check and render those newly visible objects.

Source: Hacker News