Flash-MoE: Running a 397B Parameter Model on a Laptop

A breakthrough inference engine written in pure C/Metal allows a 397B parameter MoE model to run on a 48GB RAM MacBook Pro. By streaming weights directly from SSD, it achieves 4.4+ tokens/s with production-quality output.

Read the paper— Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

| Configuration | tok/s | Quality | Notes | |---|---|---|---| | 4-bit experts, FMA kernel | 4.36 | Excellent | Current best. Full tool calling. 209GB on disk. | | 4-bit experts, baseline | 3.90 | Excellent | Before FMA kernel optimization. | | 2-bit experts, trust OS | 5.74 | Good* | 120GB on disk. Breaks JSON/tool calling. | | 2-bit peak single token | 7.05 | Good | Warm cache burst. *Not suitable for tool use. |

*2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Machine: MacBook Pro, Apple M3 MaxChip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANEMemory: 48 GB unified (~400 GB/s bandwidth)SSD: 1TB Apple Fabric,17.5 GB/s sequential read(measured)macOS: 26.2 (Darwin 25.2.0)

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.

SSD Expert Streaming— Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallelpread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper.
FMA-Optimized Dequant Kernel— The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from(nibble * scale + bias) * x tofma(nibble, scale*x, bias*x). Pre-computingscale*x andbias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.
Metal Compute Shaders— Hand-written Metal kernels for: 4-bit and 2-bit dequantized matrix-vector multiply, Fused SwiGLU activation, RMS normalization, Batched GPU attention, GPU RoPE, MoE combine + residual + sigmoid gate.
Deferred GPU Expert Compute— CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer.
Accelerate BLAS for Linear Attention— The GatedDeltaNet recurrence usescblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update. 64% faster than scalar code.
Trust the OS— No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. The page cache achieves ~71% hit rate naturally.