Running Google Gemma 4 Locally with LM Studio's New Headless CLI and Claude Code

LM Studio 0.4.0 introduces the 'lms' CLI, enabling seamless local execution of models like Google Gemma 4 on macOS without a GUI. This guide covers setting up the 26B-A4B model for optimized coding workflows with Claude Code.
Running Google Gemma 4 Locally With LM Studio’s New Headless CLI & Claude Code
LM Studio 0.4.0 introduced llmster and the lms CLI. Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.
Why run models locally?
Cloud AI APIs are great until they are not. Rate limits, usage costs, privacy concerns, and network latency all add up. For quick tasks like code review, drafting, or testing prompts, a local model that runs entirely on your hardware has real advantages: zero API costs, no data leaving your machine, and consistent availability.
Google’s Gemma 4 is interesting for local use because of its mixture-of-experts architecture. The 26B parameter model only activates 4B parameters per forward pass, which means it runs well on hardware that could never handle a dense 26B model. On my 14” MacBook Pro M4 Pro with 48 GB of unified memory, it fits comfortably and generates at 51 tokens per second. Though there’s significant slowdowns when used within Claude Code from my experience.
The Gemma 4 model family
Google released Gemma 4 as a family of four models, not just one. The lineup spans a wide range of hardware targets:
The “E” models (E2B, E4B) use Per-Layer Embeddings to optimize for on-device deployment and are the only variants that support audio input (speech recognition and translation). The 31B dense model is the most capable, scoring 85.2% on MMLU Pro and 89.2% on AIME 2026.
Why I picked the 26B-A4B. The mixture-of-experts architecture is the key. It has 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. A common rule of thumb estimates MoE dense - equivalent quality as roughly sqrt(total x active parameters), which puts this model around 10B effective. In practice, it delivers inference cost comparable to a 4B dense model with quality that punches well above that weight class. On benchmarks, it scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B (85.2% and 89.2%) while running dramatically faster.
The chart below tells the story. It plots Elo score against total model size on a log scale for recent open-weight models with thinking enabled. The blue-highlighted region in the upper left is where you want to be: high performance, small footprint.
Gemma 4 26B-A4B (Elo ~1441) sits firmly in that zone, punching well above its 25.2B parameter weight. The 31B dense variant scores slightly higher (~1451) but is still remarkably compact. For context, models like Qwen 3.5 397B-A17B (~1450 Elo) and GLM-5 (~1457 Elo) need 100-600B total parameters to reach similar scores. Kimi-K2.5 (~1457 Elo) requires over 1,000B. The 26B-A4B achieves competitive Elo with a fraction of the parameters, which translates directly into lower memory requirements and faster local inference.
This is what makes MoE models transformative for local use. You do not need a cluster or a high-end GPU rig to run a model that competes with 400B+ parameter behemoths. A laptop with 48 GB of unified memory is enough.
For local inference on a 48 GB Mac, this is the sweet spot. The dense 31B would consume more memory and generate tokens slower because every parameter participates in every forward pass. The E4B is lighter but noticeably less capable. The 26B-A4B gives you 256K max context, vision support (useful for analyzing screenshots and diagrams), native function/tool calling, and reasoning with configurable thinking modes, all at 51 tokens/second on my hardware.
What changed in LM Studio 0.4.0
LM Studio has been a popular desktop app for running local models for a while. Version 0.4.0 changed the architecture fundamentally by introducing llmster, the core inference engine extracted from the desktop app and packaged as a standalone server.
The practical result: you can now run LM Studio entirely from the command line using the lms CLI. No GUI required. This makes it usable on headless servers, in CI/CD pipelines, SSH sessions, or just for developers who prefer staying in the terminal.
Key additions in 0.4.0:
llmster daemon: a background service that manages model loading and inference without the desktop app
lms CLI: full command-line interface for downloading, loading, chatting, and serving models
Parallel request processing: continuous batching instead of sequential queuing, so multiple requests to the same model run concurrently
Stateful REST API: a new /v1/chat endpoint that maintains conversation history across requests
MCP integration: local Model Context Protocol support with permission-key gating
Installation
Install the lms CLI with a single command:
# Linux/Mac
curl -fsSL https://lmstudio.ai/install.sh | bash
# Windows
irm https://lmstudio.ai/install.ps1 | iex
Then start the headless daemon:
lms daemon up
On macOS, update both inference runtimes:
lms runtime update llama.cpp
lms runtime update mlx
Downloading Gemma 4
With the daemon running, download Google’s Gemma 4 26B model:
lms get google/gemma-4-26b-a4b
The CLI shows you the variant it will download (Q4_K_M quantization by default, 17.99 GB) and asks for confirmation:
↓ To download: model google/gemma-4-26b-a4b - 64.75 KB
└─ ↓ To download: Gemma 4 26B A4B Instruct Q4_K_M [GGUF] - 17.99 GB
About to download 17.99 GB.
? Start download?
❯ Yes
No
Change variant selection
If you already have the model, the CLI tells you and shows the load command:
✔ Start download? yes
Model already downloaded. To use, run: lms load google/gemma-4-26b-a4b
Checking your local model library
List all downloaded models:
lms ls
You have 10 models, taking up 118.17 GB of disk space.
LLM PARAMS ARCH SIZE DEVICE
gemma-3-270m-it-mlx 270m gemma3_text 497.80 MB Local
google/gemma-4-26b-a4b (1 variant) 26B-A4B gemma4 17.99 GB Local
gpt-oss-20b-mlx 20B gpt_oss 22.26 GB Local
llama-3.2-1b-instruct 1B Llama 712.58 MB Local
nvidia/nemotron-3-nano (1 variant) 30B nemotron_h 17.79 GB Local
openai/gpt-oss-20b (1 variant) 20B gpt-oss 12.11 GB Local
qwen/qwen3.5-35b-a3b (1 variant) 35B-A3B qwen35moe 22.07 GB Local
qwen2.5-0.5b-instruct-mlx 0.5B Qwen2 293.99 MB Local
zai-org/glm-4.7-flash (1 variant) 30B glm4_moe_lite 24.36 GB Local
EMBEDDING PARAMS ARCH SIZE DEVICE
text-embedding-nomic-embed-text-v1.5 Nomic BERT 84.11 MB Local
Worth noting: several of these models use mixture-of-experts architectures (Gemma 4, Qwen 3.5, GLM 4.7 Flash). MoE models punch above their weight for local inference because only a fraction of parameters activate per token.
Running an interactive chat
Start a chat session with stats enabled to see performance numbers:
lms chat google/gemma-4-26b-a4b --stats
╭─────────────────────────────────────────────────╮
│ 👾 lms chat │
│ Type exit or Ctrl+C to quit │
│ │
│ Chatting with google/gemma-4-26b-a4b │
│ │
│ Try one of the following commands: │
│ /model - Load a model (type /model to see list) │
│ /download - Download a model │
│ /clear - Clear the chat history │
│ /help - Show help information │
╰─────────────────────────────────────────────────╯
With --stats, you get prediction metrics after each response:
Prediction Stats:
Stop Reason: eosFound
Tokens/Second: 51.35
Time to First Token: 1.551s
Prompt Tokens: 39
Predicted Tokens: 176
Total Tokens: 215
51 tokens/second on a 14” MacBook Pro M4 Pro (48 GB) with a 26B model is solid. Time to first token at 1.5 seconds is responsive enough for interactive use.
Checking loaded models and memory
See what is currently loaded:
lms ps
IDENTIFIER MODEL STATUS SIZE CONTEXT PARALLEL DEVICE TTL
google/gemma-4-26b-a4b google/gemma-4-26b-a4b IDLE 17.99 GB 48000 2 Local 60m / 1h
The model occupies 17.99 GB in memory with a 48K context window and supports 2 parallel requests. The TTL (time-to-live) auto-unloads the model after 1 hour of idle time, freeing memory without manual intervention.
Source: Hacker News













