Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running massive models exceeding physical memory by streaming tensors from NVMe.

 _   _                                
| | | |_   _ _ __  _   _ _ __ __ _ 
| |_| | | | | '_ \| | | | '__/ _` |
|  _  | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_|  \__,_|
       |___/|_|                    
Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:

Norms and embeddings are tiny but accessed every token — pinned to GPU.
MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality.
Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident.

The result: models that would crash your machine under naive mmap become runnable. Models that fit in memory run at full Metal GPU speed with zero overhead.

Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

GPU (Metal): Attention layers, norms, embeddings.
RAM: Overflow layers that don't fit in the GPU working set.
NVMe: Remaining layers loaded on-demand via direct I/O, prefetched ahead of the forward pass.

| Model | Size | GPU | NVMe | Mode | Hypura | llama.cpp | Notes | |---|---|---|---|---|---|---|---| | Qwen 2.5 14B | 8.4 GB | 8.4 GB | — | full-resident | 21 tok/s | ~21 tok/s | Fits in GPU | | Mixtral 8x7B | 30.9 GB | 1.1 GB | 29.8 GB | expert-streaming | 2.2 tok/s | OOM | 99.5% cache hit | | Llama 3.3 70B | 39.6 GB | 7.8 GB | 31.8 GB | dense-FFN-streaming | 0.3 tok/s | OOM | Dynamic prefetch |

Hypura exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama. Regarding SSD wear: Hypura only reads from your SSD during inference — it never writes to it. Reads do not degrade flash cells, making it safe for long-term use.

Source: Hacker News