Stop Using Ollama

Ollama gained popularity as a convenient wrapper for local LLMs, but it has systematically obscured its reliance on llama.cpp, suffered performance regressions, and misled users with confusing model naming.

Ollama is the most popular way to run local LLMs. It shouldn’t be. It gained that position by being first, the first tool that made llama.cpp accessible to people who didn’t want to compile C++ or write their own server configs. That was a real contribution, briefly. But the project has since spent years systematically obscuring where its actual technology comes from, misleading users about what they’re running, and drifting from the local-first mission that earned it trust in the first place. All while taking venture capital money.

This isn’t a “both sides” piece. I’ve used Ollama. I’ve moved on. Here’s why you should too.

A llama.cpp Wrapper With Amnesia

Ollama’s entire inference capability comes from llama.cpp, the C++ inference engine created by Georgi Gerganov in March 2023. Gerganov’s project is what made it possible to run LLaMA models on consumer laptops at all, he hacked together the first version in an evening, and it kicked off the entire local LLM movement. Today llama.cpp has over 100,000 stars on GitHub, 450+ contributors, and is the foundation that nearly every GGUF-based tool depends on.

Ollama was founded in 2021 by Jeffrey Morgan and Michael Chiang, both previously behind Kitematic, a Docker GUI that was acquired by Docker Inc. They went through Y Combinator’s Winter 2021 batch, raised pre-seed funding, and launched publicly in 2023. From day one, the pitch was “Docker for LLMs”, a convenient wrapper that downloads and runs models with a single command. Under the hood, it was llama.cpp doing all the work.

For over a year, Ollama’s README contained no mention of llama.cpp. Not in the README, not on the website, not in their marketing materials. The project’s binary distributions didn’t include the required MIT license notice for the llama.cpp code they were shipping. This isn’t a matter of open-source etiquette, the MIT license has exactly one major requirement: include the copyright notice. Ollama didn’t.

The community noticed. GitHub issue #3185 was opened in early 2024 requesting license compliance. It went over 400 days without a response from maintainers. When issue #3697 was opened in April 2024 specifically requesting llama.cpp acknowledgment, community PR #3700 followed within hours. Ollama’s co-founder Michael Chiang eventually added a single line to the bottom of the README: “llama.cpp project founded by Georgi Gerganov.”

The response to the PR was revealing. Ollama’s team wrote: “We spend a large chunk of time fixing and patching it up to ensure a smooth experience for Ollama users… Overtime, we will be transitioning to more systematically built engines.” Translation: we’re not going to give llama.cpp prominent credit, and we plan to distance ourselves from it anyway.

As one Hacker News commenter put it: “I’m continually puzzled by their approach, it’s such self-inflicted negative PR. Building on llama is perfectly valid and they’re adding value on ease of use here. Just give the llama team proper credit.” Another: “The fact that Ollama has been downplaying their reliance on llama.cpp has been known in the local LLM community for a long time.”

The Fork That Made Things Worse

In mid-2025, Ollama followed through on that distancing. They moved away from using llama.cpp as their inference backend and built a custom implementation directly on top of ggml, the lower-level tensor library that llama.cpp itself uses. Their stated reason was stability, llama.cpp moves fast and breaks things, and Ollama’s enterprise partners need reliability.

The result was the opposite. Ollama’s custom backend reintroduced bugs that llama.cpp had solved years ago. Community members flagged broken structured output support, vision model failures, and GGML assertion crashes across multiple versions. Models that worked fine in upstream llama.cpp failed in Ollama, including new releases like GPT-OSS 20B, where Ollama’s implementation lacked support for tensor types that the model required. Georgi Gerganov himself identified that Ollama had forked and made bad changes to GGML.

The irony is thick. They downplayed their dependence on llama.cpp for years, then when they finally tried to go it alone, they produced an inferior version of the thing they refused to credit.

Benchmarks tell the story. Multiple community tests show llama.cpp running 1.8x faster than Ollama on the same hardware with the same model, 161 tokens per second versus 89. On CPU, the gap is 30-50%. A recent comparison on Qwen-3 Coder 32B showed ~70% higher throughput with llama.cpp. The performance overhead comes from Ollama’s daemon layer, poor GPU offloading heuristics, and a vendored backend that trails upstream.

Misleading Model Naming

When DeepSeek released its R1 model family in January 2025, Ollama listed the smaller distilled versions, models like DeepSeek-R1-Distill-Qwen-32B, which are fine-tuned Qwen and Llama models, not the actual 671-billion-parameter R1, simply as “DeepSeek-R1” in their library and CLI. Running ollama run deepseek-r1 pulls an 8B Qwen-derived distillate that behaves nothing like the real model.

This wasn’t an oversight. DeepSeek themselves named these models with the “R1-Distill” prefix. Hugging Face listed them correctly. Ollama stripped the distinction. The result was a flood of social media posts from people claiming they were running “DeepSeek-R1” on consumer hardware, followed by confusion about why it performed poorly, doing reputational damage to DeepSeek in the process.

GitHub issues #8557 and #8698 requested separation of the models. Both were closed as duplicates with no fix. As of today, ollama run deepseek-r1 still launches a tiny distilled model. Ollama knew the difference and chose to obscure it, presumably because “DeepSeek-R1” drives more downloads than “DeepSeek-R1-Distill-Qwen-32B” does.

The Closed-Source App

In July 2025, Ollama released a GUI desktop app for macOS and Windows. The app was developed in a private repository (github.com/ollama/app), shipped without a license, and the source code wasn’t publicly available. For a project that had built its reputation on being open-source, this was a jarring move.

Community members immediately raised concerns. The license issue received 40 upvotes. Developers found potential AGPL-3.0 dependencies in the binary. The website placed the download button next to a GitHub link, giving the impression users were downloading the MIT-licensed open-source tool when they were actually getting an unlicensed closed-source application. Maintainers were silent for months. The code was eventually merged into the main repo in November 2025, but the initial rollout revealed where the project’s instincts lie.

As XDA put it: “If your project trades on being open source, you do not get to be vague about what is and is not open at launch.”

The Modelfile: Reinventing a Solved Problem

GGUF, the model format created by Georgi Gerganov, was designed with one core principle: single-file deployment. Bullet point #1 in the GGUF spec reads: “Full information: all information needed to load a model is contained in the model file, and no additional information needs to be provided by the user.” Chat templates, stop tokens, model metadata, it’s all embedded in the file. You point llama.cpp at a GGUF and it works.

Ollama added the Modelfile on top of this. It’s a separate configuration file, inspired by Dockerfiles, naturally, that specifies the base model, chat template, system prompt, sampling parameters, and stop tokens. Most of this information already exists inside the GGUF file. As one Hacker News commenter put it: “We literally just got rid of that multi-file chaos only for Ollama to add it back.”

The problems with this approach compound quickly. Ollama only auto-detects chat templates it already knows about from a hardcoded list. If a GGUF file has a valid Jinja chat template embedded in its metadata but it doesn’t match one of Ollama’s known templates, Ollama fails.

Source: Hacker News