TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

SwiftLM is a blazingly fast, native Swift inference server for Apple Silicon that eliminates Python overhead. It introduces hybrid TurboQuant KV compression and experimental SSD streaming to run massive 122B+ models on consumer hardware.

A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.

No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.

🍎 100% Native Apple Silicon: Powered natively by Metal and Swift.
🔌 OpenAI-compatible: Drop-in replacement for OpenAI SDKs (/v1/chat/completions, streaming, etc).
🧠 Smart Model Routing: Loads HuggingFace format models directly, with native Safetensors parsing.
⚡️ TurboQuantization Integrated: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
💾 SSD Expert Streaming: Experimental zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory.
🎛️ Granular Memory Control: Integrated Layer Partitioning and Wisdom Auto-Calibration for squeezing massive models into RAM.

SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.

The "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead.

Benchmarks on M5 Pro:

Machine: MacBook Pro, Apple M5 Pro
Memory: 64 GB Unified Memory
Model: Qwen3.5-122B-A10B-4bit
SSD: Internal Apple NVMe (Zero-Copy Streaming)

iOS Support: A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.

Source: Hacker News

More in this category

dev-tools

GLM 5.2 Is Out

Zhipu AI has officially released GLM-5.2, its most powerful open-source model to date, featuring a 1M context window and advanced long-horizon task capabilities. The release underscores Zhipu's commitment to open-source AI and global scientific collaboration amid rising technological restrictions.

dev-tools

Noise infusion banned from statistical products published by Census Bureau

The U.S. Department of Commerce has banned "noise infusion" from statistical products published by the Census Bureau, a decision that could have severe consequences for both data utility and privacy protection.

dev-tools

Treating pancreatic tumours may have revealed cancer's master switch

A promising new drug called daraxonrasib has shown breakthrough results in treating pancreatic cancer, doubling median survival times. This achievement could pave the way for an entirely new class of cancer treatments.

dev-tools

Every Frame Perfect

In UI design, perfection isn't just about the start and end states, but every single transition frame in between. Polishing these micro-interactions is key to building user trust.

dev-tools

Leaving Mozilla

A poignant and candid reflection from a 15-year Mozilla veteran upon their departure. The author highlights the leadership's missteps in trying to emulate tech giants and urges Mozilla to return to its core values: community and uniqueness.

dev-tools

Shepherd's Dog: A Game by the Most Dangerous AI Model

A developer tested Anthropic's latest, supposedly 'too dangerous' AI model by asking it to build a long-held game idea in a single shot. The model succeeded, generating a complete 2,319-line game after a 45-minute reasoning session.

EXPLORE TOPICS