Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

Flash-Moe enables running the massive Qwen3.5-397B model on a 48GB RAM MacBook Pro at 4.4+ tokens/second by streaming weights directly from SSD. Built with pure C and Metal, it bypasses traditional memory constraints through innovative expert-loading techniques and hardware-level optimizations.

Read the paper— Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

| Configuration | tok/s | Quality | Notes | |---|---|---|---| | 4-bit experts, FMA kernel | 4.36 | Excellent | Current best. Full tool calling. 209GB on disk. | | 4-bit experts, baseline | 3.90 | Excellent | Before FMA kernel optimization. | | 2-bit experts, trust OS | 5.74 | Good* | 120GB on disk. Breaks JSON/tool calling. | | 2-bit peak single token | 7.05 | Good | Warm cache burst. *Not suitable for tool use. |

*2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Machine: MacBook Pro, Apple M3 Max Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE Memory: 48 GB unified (~400 GB/s bandwidth) SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured) macOS: 26.2 (Darwin 25.2.0)

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.

SSD Expert Streaming— Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper.
FMA-Optimized Dequant Kernel— The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.
Metal Compute Shaders— Hand-written Metal kernels for: 4-bit and 2-bit dequantized matrix-vector multiply, Fused SwiGLU activation, RMS normalization, Batched GPU attention, GPU RoPE, and MoE combine.
Deferred GPU Expert Compute— CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer.
Accelerate BLAS for Linear Attention— The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger. 64% faster than scalar code.
Trust the OS— No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. The page cache achieves ~71% hit rate naturally.

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

This is a primary development machine. The engine explicitly controls memory: