TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

SwiftLM is a blazingly fast, native Swift inference server for Apple Silicon that eliminates Python overhead. It introduces hybrid TurboQuant KV compression and experimental SSD streaming to run massive 122B+ models on consumer hardware.
A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.
No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.
- 🍎 100% Native Apple Silicon: Powered natively by Metal and Swift.
- 🔌 OpenAI-compatible: Drop-in replacement for OpenAI SDKs (
/v1/chat/completions, streaming, etc). - 🧠 Smart Model Routing: Loads HuggingFace format models directly, with native Safetensors parsing.
- ⚡️ TurboQuantization Integrated: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
- 💾 SSD Expert Streaming: Experimental zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory.
- 🎛️ Granular Memory Control: Integrated Layer Partitioning and Wisdom Auto-Calibration for squeezing massive models into RAM.
SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.
The "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead.
Benchmarks on M5 Pro:
- Machine: MacBook Pro, Apple M5 Pro
- Memory: 64 GB Unified Memory
- Model: Qwen3.5-122B-A10B-4bit
- SSD: Internal Apple NVMe (Zero-Copy Streaming)
iOS Support: A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.
Source: Hacker News












