NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...2 min read

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

Share
NOW LET US Article – Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

Flash-Moe enables running the massive Qwen3.5-397B model on a 48GB RAM MacBook Pro at 4.4+ tokens/second by streaming weights directly from SSD. Built with pure C and Metal, it bypasses traditional memory constraints through innovative expert-loading techniques and hardware-level optimizations.

Read the paper— Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

| Configuration | tok/s | Quality | Notes | |---|---|---|---| | 4-bit experts, FMA kernel | 4.36 | Excellent | Current best. Full tool calling. 209GB on disk. | | 4-bit experts, baseline | 3.90 | Excellent | Before FMA kernel optimization. | | 2-bit experts, trust OS | 5.74 | Good* | 120GB on disk. Breaks JSON/tool calling. | | 2-bit peak single token | 7.05 | Good | Warm cache burst. *Not suitable for tool use. |

*2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Machine: MacBook Pro, Apple M3 Max Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE Memory: 48 GB unified (~400 GB/s bandwidth) SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured) macOS: 26.2 (Darwin 25.2.0)

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.

  • SSD Expert Streaming— Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper.
  • FMA-Optimized Dequant Kernel— The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.
  • Metal Compute Shaders— Hand-written Metal kernels for: 4-bit and 2-bit dequantized matrix-vector multiply, Fused SwiGLU activation, RMS normalization, Batched GPU attention, GPU RoPE, and MoE combine.
  • Deferred GPU Expert Compute— CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer.
  • Accelerate BLAS for Linear Attention— The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger. 64% faster than scalar code.
  • Trust the OS— No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. The page cache achieves ~71% hit rate naturally.

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

This is a primary development machine. The engine explicitly controls memory:

  • Non-expert weights: 5.5GB (mmap'd, read-only)
  • Metal scratch buffers: ~200MB
  • Total: ~6GB, leaving 42GB for OS + page cache
  • No OOM risk. Expert data streams from SSD on demand.
© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.