NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...1 min read

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

Share
NOW LET US Article – Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running models exceeding physical memory by intelligently placing tensors across GPU, RAM, and NVMe tiers.

_ _
| | | |_ _ _ __ _ _ _ __ __ _
| |_| | | | | '_ \| | | | '__/ _` |
| _ | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_| \__,_|
|___/|_|
Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:

  • Norms and embeddings are tiny but accessed every token — pinned to GPU.
  • MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality.
  • Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident.

Hypura selects the best inference mode automatically based on model size, architecture, and available memory. It profiles your hardware (GPU working set, RAM, NVMe bandwidth) and solves a placement optimization that assigns every tensor to a tier: GPU (Metal), RAM, or NVMe (loaded on-demand via direct I/O).

For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." It also exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.