DEV-TOOLSMarch 24, 20261 min read16 views

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running models exceeding physical memory by intelligently placing tensors across GPU, RAM, and NVMe tiers.

_ _
| | | |_ _ _ __ _ _ _ __ __ _
| |_| | | | | '_ \| | | | '__/ _` |
| _ | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_| \__,_|
|___/|_|
Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:

Norms and embeddings are tiny but accessed every token — pinned to GPU.
MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality.
Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident.

Hypura selects the best inference mode automatically based on model size, architecture, and available memory. It profiles your hardware (GPU working set, RAM, NVMe bandwidth) and solves a placement optimization that assigns every tensor to a tier: GPU (Metal), RAM, or NVMe (loaded on-demand via direct I/O).

For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." It also exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama.

Source: Hacker News

More in this category

dev-tools

OpenAI and Hugging Face address security incident during model evaluation

OpenAI and Hugging Face detailed a security incident where an AI agent exploited vulnerabilities during internal cyber capability evaluations, underscoring the critical need for advanced safeguards as AI models gain sophisticated technical capabilities.

dev-tools

The State of Simulation for Physical AI: An Overview

Data availability is the primary bottleneck for Physical AI. GPU-accelerated simulation platforms solve this by generating scalable synthetic datasets to train, test, and deploy next-generation robotics systems.

dev-tools

Linux kernel will support $ORIGIN, sort of

Farid Zakaria shares his journey of proposing a patch to the Linux kernel to support relocatable binaries in Nix, which evolved into a powerful eBPF-based solution for binfmt_misc.

dev-tools

Five US tech giants' hidden debts soar to $1.65T on opaque AI funding

A Nikkei study reveals that off-balance-sheet debts at five major US tech companies have surged eightfold to $1.65 trillion, driven by massive AI investments like data center leases and GPU contracts.

dev-tools

Running Doom on Our Custom CPU and Going Viral

Two developers successfully built a custom CPU from scratch at the logic gate level, deployed it on an FPGA, and optimized its memory architecture to run the classic game DOOM.

dev-tools

Incremental – A library for incremental computations

Incremental is a library designed for building complex computations that update efficiently when inputs change. Inspired by self-adjusting computation research, it is highly useful for large-scale calculations, GUI views, and data synchronization.