Zero-Copy GPU Inference from WebAssembly on Apple Silicon

A new technique enables WebAssembly modules to share linear memory directly with GPUs on Apple Silicon, eliminating data copying and serialization overhead. This approach leverages Unified Memory Architecture to significantly boost AI inference performance and state portability.

tl;dr: on Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU: no copies, no serialization, no intermediate buffers. The CPU and GPU read and write the same physical bytes. End-to-end, it works: a Wasm guest fills a matrix in its linear memory, the GPU reads it, computes, writes back, and the guest sees the result through the same pointer, same memory, zero copies.

Normally Wasm and GPUs are separated by an expensive serialization boundary: on most hardware, getting data from a VM sandbox to an accelerator means copying across a bus. Apple Silicon's Unified Memory Architecture erases that boundary (no bus, same physical memory), and what falls out is a runtime where Wasm is the control plane and the GPU is the compute plane, with near-zero overhead between them.

Why this is normally hard

WebAssembly gives you a sandbox. Your module gets a flat byte array (linear memory) and that's the universe. GPUs also want a flat byte array, but a specific kind: page-aligned, pinned, accessible to the DMA engine. On a discrete GPU, that memory sits across a PCIe bus from the CPU, so getting data from a Wasm module's linear memory to the GPU means: copy out of the sandbox into host memory, then copy across the bus into GPU memory. Two copies, two latency hits.

Apple Silicon changes the physics. The CPU and GPU share the same physical memory... no bus! The real question: can you thread that pointer through the layers of abstraction without anyone making a defensive copy along the way? Turns out... you can!

The three-link chain

mmap gives you page-aligned memory. On ARM64 macOS, mmap returns 16 KB-aligned addresses, which Metal requires.
Metal accepts that pointer without copying. MTLDevice.makeBuffer(bytesNoCopy:length:) wraps an existing pointer as a Metal buffer. On Apple Silicon, this is the zero-copy path.
Wasmtime lets you bring your own allocator. Wasmtime's MemoryCreator trait lets you control how linear memory is allocated. You provide the mmap region, and Wasmtime uses it directly.

The composition: allocate an mmap region, hand it to both Wasmtime and Metal. The Wasm module writes data, the GPU computes on it in place, and the results appear in the module's linear memory with no copies.

Measurements and AI Inference

Testing with a 128x128 matrix multiply showed zero errors and zero memory overhead (RSS delta ~0.03 MB). When applied to Llama 3.2 1B using Apple's MLX framework, the host function boundary overhead was negligible.

This technique is particularly powerful for KV cache portability. By serializing the GPU-accessible memory, we can save and restore the state of an AI conversation. For 24 tokens, restoring from disk was 5.45x faster than re-computing the prefill. At larger context lengths, this speedup scales linearly, potentially reaching 100x efficiency gains for stateful AI actors.

Source: Hacker News