NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
STARTUPS-VC...5 min read

Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks

Share
NOW LET US Article – Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks

Xiaomi's MiMo AI team has open-sourced MiMo Code V0.1.0, a terminal-native AI coding assistant that outperforms Anthropic's Claude Code on key agentic coding benchmarks, especially on long-horizon, multi-step tasks.

Xiaomi's MiMo AI team has open-sourced MiMo Code V0.1.0, a terminal-native AI coding assistant that the Chinese electronics giant says outperforms Anthropic's Claude Code on key agentic coding benchmarks, especially on long-horizon, multi-step tasks (200+ steps) — at least, according to its own internal beta release and survey of 576 developers.

It's also bundling limited-time free access to MiMo-V2.5, its multimodal flagship model with a million-token context window, requiring no registration to get started.

The release was announced June 10, 2026 in a post on the social network X from the official @XiaomiMiMo account, which described the tool as "more than an AI coding assistant in your terminal — it's the smartest coding partner you'll ever work with."

MiMo Code is available now on GitHub under an MIT license, and installs with a single terminal command (curl -fsSL https://mimo.xiaomi.com/install | bash) on macOS and Linux or via npm (npm install -g @mimo-ai/cli) on Windows.

The project is a fork of the open-source OpenCode agent, which Xiaomi has extended with its own memory architecture, workflow modes, and model harness.

The end of AI coding agents' amnesia?

As any avid vibe coder would surely attest, AI coding agents degrade over long working sessions: as the context window fills, earlier decisions, conventions, and task state get compacted away or lost entirely, forcing developers to re-explain their projects.

Xiaomi argues this approach is doomed at scale. "What we need is not better compression, but an explicit storage-and-retrieval mechanism that decides what information should be written into persistent structures, and when it should be recalled," the MiMo team noted in their launch blog.

MiMo Code attacks this with a cross-session memory system, powered under the hood by SQLite FTS5 full-text search, that spans four layers: project memory (a persistent MEMORY.md file), session checkpoints, scratch notes, and per-task progress logs.

The note-taking is key, here: Rather than forcing the primary coding agent to pause its work to take notes, the system deploys an independent "checkpoint-writer" subagent.

Think of the primary coding agent as a construction contractor working to build a massive mansion alongside a dedicated architect, the checkpoint-writer subagent. While the main agent focuses on building out the physical structure, the subagent updates the blueprints in real time, noting decisions, issues, and the actual lay of the land as the construction project progresses.

When the context window approaches its limits — the contractor gets lost in the half-built mansion — it can consult the subagent and find its place again. In the case of MiMo Code, the system simply rebuilds the environment from structured checkpoints with the relevant context, ensuring no loss of operational momentum.

Two self-improvement mechanisms round out the system: a /dream command that periodically (roughly every seven days) reviews historical sessions, deduplicates them, and compresses them into long-term memory, and a "distill" function that mines past sessions for repeated workflows that can be automated, following a similar approach taken recently by OpenAI and Anthropic with their various models.

Impressive performance on software engineering (SWE) benchmarks

According to benchmark figures published in Xiaomi's technical blog post, MiMo Code paired with MiMo-V2.5-Pro outperformed Claude Code paired with Claude Sonnet 4.6 on all three evaluations tested:

  • SWE-bench Verified: 82% vs. 79%
  • SWE-bench Pro: 62% vs. 55%
  • Terminal Bench 2: 73% vs. 69%

The harness itself accounts for a measurable share of the gain. Running the same MiMo-V2.5-Pro model in both harnesses, MiMo Code scored 62% on SWE-bench Pro versus 57% for Claude Code, and 73% on Terminal Bench 2 versus 68% — roughly five points each, attributable purely to the agent system rather than the model.

Xiaomi notably did not publish comparisons against OpenAI's Codex or Google's Gemini CLI — Claude Code is the sole named competitor throughout its materials, a telling choice of benchmark target.

Independent reference points suggest why. On the official Terminal-Bench 2.0 leaderboard maintained at tbench.ai, OpenAI's Codex CLI running GPT-5.5 scores 82.2% — roughly nine points above MiMo Code's self-reported 73% — and OpenAI's own GPT-5.5 announcement claims 82.7% on the same benchmark.

On SWE-Bench Pro, however, the picture flips: OpenAI reports GPT-5.5 at 58.6%, below MiMo Code + MiMo-V2.5-Pro's claimed 62%. (MiMo Code does not yet appear on either official leaderboard, and cross-comparing self-run numbers against leaderboard submissions carries the usual configuration caveats.)

Perhaps more interesting than the offline benchmarks: Xiaomi says it ran a human double-blind A/B evaluation during its internal beta, covering 576 developers working in 474 real private repositories, producing 1,213 judged head-to-head pairs against Claude Code using the same target model.

Under 200 execution steps, the two systems split roughly 50/50 — but past 200 steps, MiMo Code's win rate rose above 65%, supporting the company's thesis that its memory and state-management architecture pays off specifically on long-horizon work.

Xiaomi itself concedes the standard benchmarks "still measure one-shot problem-solving ability" and don't capture the tool's multi-session design goals.

As always, these are vendor self-reported numbers that haven't been independently verified, and head-to-head harness comparisons are sensitive to configuration. But the claims are consistent with a broader industry pattern: scaffolding and harness engineering are becoming as important as raw model capability in agentic coding performance.

Easy integration with existing developer systems and voice control

From a user experience standpoint, MiMo Code is designed to live where developers already work. It operates directly in the terminal, reading and writing files, running commands, and managing Git.

Out of the box, the tool requires zero configuration, connecting automatically to "MiMo Auto"—a free-for-a-limited-time channel powered by Xiaomi’s multimodal MiMo V2.5 model, which boasts a massive million-token context window. For developers migrating from existing environments, the transition is frictionless: MiMo Code automatically imports MCP servers, custom skills, and API configurations from Claude Code.

Other noteworthy features include:

  • Compose mode: Pressing Tab switches the agent into a specification-driven workflow in which the developer describes a high-level goal and the system autonomously executes the full development cycle — design, planning, coding, testing, and review — following what Xiaomi describes as a "heavy planning upfront, stable verification later" strategy.
  • Voice control: Built on Xiaomi's MiMo-ASR speech recognition with TenVAD voice activity detection, developers can dictate and modify instructions verbally and speak commands like "send" and "execute" for fully hands-free operation (available for logged-in users).

According to Xiaomi, the gains from the agent harness itself are measurable. Running the same underlying MiMo model in both harnesses, the company says MiMo Code scored 62% on SWE-Bench Pro versus 57% for Claude Code, and 73% on Terminal Bench 2 versus Claude Code's 68% — roughly five percentage points better on each, attributable purely to the agent system rather than the model.

As always, these are vendor self-reported numbers that haven't been independently verified, and head-to-head harness comparisons are sensitive to configuration. But the claim is consistent with a broader industry pattern: scaffolding and harness engineering are becoming as important as raw model capability in agentic coding performance.

© 2026 Now Let Us. All rights reserved.

Source: VentureBeat

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Anthropic blocks all public access to Claude Fable 5, Mythos 5 following US government order — what enterprises should do

startups-vc

Anthropic blocks all public access to Claude Fable 5, Mythos 5 following US government order — what enterprises should do

Following an unprecedented US government export control directive, Anthropic has globally suspended all access to its newly released Claude Fable 5 and Mythos 5 models. This sudden blackout highlights the urgent need for enterprises to diversify their AI supply chains and adopt model-agnostic architectures.

NOW LET US Related – Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

startups-vc

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Moonshot AI released Kimi K2.7-Code this week, claiming a 30% reduction in thinking-token usage and double-digit performance gains, but independent practitioners are already questioning the model's real-world capabilities.

NOW LET US Related – Google researchers introduce 'faithful uncertainty,' allowing LLMs to offer best guesses instead of hallucinations

startups-vc

Google researchers introduce 'faithful uncertainty,' allowing LLMs to offer best guesses instead of hallucinations

Google researchers have introduced 'faithful uncertainty,' a metacognitive technique that aligns an LLM's response with its internal confidence, allowing models to offer hedged hypotheses instead of defaulting to hallucinations or unhelpful silence.

NOW LET US Related – NanoClaw and JFrog launch 'immune system' to block AI agents from downloading malicious code

startups-vc

NanoClaw and JFrog launch 'immune system' to block AI agents from downloading malicious code

NanoCo AI and JFrog have partnered to launch a new security integration that acts as an "immune system" to protect autonomous AI agents from software supply chain attacks. The solution prevents AI agents from silently downloading and executing malicious code while performing background tasks.

NOW LET US Related – PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x

startups-vc

PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x

A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper introducing PixelRAG, a system that skips text parsing entirely by rendering pages as screenshots. It outperforms traditional text-based RAG in accuracy while slashing AI agent token costs by 10x.

NOW LET US Related – Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights

startups-vc

Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights

Microsoft has open-sourced SkillOpt, a framework that automatically optimizes AI agent skills using deep-learning-style controls. By treating skill documents as trainable objects, it boosts agent performance without altering the underlying model's weights.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.