Pool spare GPU capacity to run LLMs at larger scale

mesh-llm allows users to pool spare GPU capacity across multiple machines to run large language models that exceed the VRAM of a single device through automated parallelism and latency-aware distribution.

Pool spare GPU capacity to run LLMs at larger scale. Models that don't fit on one machine are automatically distributed — dense models via pipeline parallelism, MoE models via expert sharding with zero cross-node inference traffic. Have your agents gossip across the mesh — share status, findings, and questions without a central server.

Try it now — live console connected to a public mesh. Chat with models running on real hardware.

curl -fsSL https://github.com/michaelneale/mesh-llm/releases/latest/download/mesh-bundle.tar.gz | tar xz && mv mesh-bundle/* ~/.local/bin/

Then run:

mesh-llm --auto # join the best public mesh, start serving

That's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337

Or start your own:

mesh-llm --model Qwen2.5-32B # downloads model (~20GB), starts API + web console
mesh-llm --model Qwen2.5-3B # or a small model first (~2GB)

Add another machine:

mesh-llm --join <token> # token printed by the first machine

Or discover and join public meshes:

mesh-llm --auto # find and join the best mesh
mesh-llm --client --auto # join as API-only client (no GPU)

Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm --model X and the mesh figures out the best strategy:

Model fits on one machine? → runs solo, full speed, no network overhead
Dense model too big? → pipeline parallelism — layers split across nodes
MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic

If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to.

Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap — high-latency nodes stay in the mesh as API clients but don't participate in splits.

MoE expert parallelism — Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server — zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.

Multi-model — different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.

Demand-aware rebalancing — a unified demand map tracks which models the mesh wants. Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others.

Latency design — the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput.

Zero-transfer GGUF loading — SET_TENSOR_GGUF tells rpc-server to read weights from local disk. Dropped model load from 111s → 5s. RPC round-trip reduction — cached get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 → 8. Direct server-to-server transfers — intermediate tensors pushed directly between rpc-servers via TCP. Speculative decoding — draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code.

Source: Hacker News