NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
STARTUPS-VC...3 min read

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Share
NOW LET US Article – Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Moonshot AI released Kimi K2.7-Code this week, claiming a 30% reduction in thinking-token usage and double-digit performance gains, but independent practitioners are already questioning the model's real-world capabilities.

Moonshot AI released Kimi K2.7-Code this week, an open-source update to its K2 coding model family, claiming leaner reasoning and double-digit performance gains.

K2.7-Code is built on the same trillion-parameter mixture-of-experts architecture as its predecessor K2.6, and drops in via an OpenAI-compatible API — which matters for teams already running K2.6 in production gateways.

When K2.6 launched in April, it topped OpenRouter's weekly LLM leaderboard — a ranking based on actual API routing decisions by developers, not self-reported benchmark scores.

Moonshot AI says K2.7-Code addresses what it calls "overthinking," reducing thinking-token usage by 30% compared to K2.6 — a number that would directly affect inference costs for teams running agentic workflows. Whether that efficiency gain holds on independent benchmarks is a question practitioners have already started raising publicly.

What Kimi K2.7-Code is

K2.7-Code is released under a Modified MIT license, with weights available on HuggingFace. The model is deployable via vLLM or SGLang. It runs exclusively in thinking mode and does not support temperature adjustment — Moonshot AI has fixed it at 1.0, meaning teams cannot tune output determinism the way they might with other models.

The core change from K2.6 is how the model generates low-level code. Where K2.6 produced implementations by wrapping existing libraries and routing through established frameworks, K2.7-Code authors implementations directly. Moonshot AI says this produces more reliable generalization across Rust, Go and Python, and across task types including frontend development, DevOps and performance optimization.

On benchmark performance, Moonshot AI claims gains of 21.8% on Kimi Code Bench v2, 11% on Program Bench and 31.5% on MLS Bench Lite. All three are proprietary benchmarks run by Moonshot AI. The model has not been submitted to DeepSWE, an independent coding benchmark that produces a 70-point spread across models — compared to SWE-Bench Pro's 30-point spread — making it a more discriminating signal for teams configuring model routing systems.

More honest, weaker for it

The picture from outside Moonshot's own benchmarks is more complicated.

Researcher Elliot Arledge ran K2.7-Code against K2.6 and Claude Fable 5 on KernelBench-Hard, a public benchmark focused on GPU kernel optimization, and published his full run logs at kernelbench.com.

"K2.7 is more honest but not more capable," Arledge wrote on X.

On five of six problems, K2.7-Code produced real authored Triton kernels where K2.6 had used library wrappers. Two of those kernels failed on the model's own bugs. The MoE kernel result regressed from K2.6's score of 0.222 to 0.157.

"Fable, for reference, tops every cell it doesn't honestly fail," Arledge wrote.

Sugumaran Balasubramaniyan, a developer who built a model-task-router for the Hermes Agent platform using DeepSWE as his reference signal, responded publicly to the K2.7-Code release and challenged Moonshot AI directly on the benchmark choices.

"Respectfully, every model 'improves' double digits on its own test suite," Balasubramaniyan wrote on X.

He noted that K2.6 scored 24% on DeepSWE, tied with GPT-5.4-mini, and asked whether Moonshot AI would submit K2.7-Code to the same benchmark.

Balasubramaniyan said it took 13 review rounds to get the benchmark data right for his router and that he would route coding tasks to K2.7-Code if the independent numbers hold up.

What this means for enterprises

The token efficiency gain is immediately usable. Teams running K2.6 in production can swap in K2.7-Code via the OpenAI-compatible API and expect lower inference costs on agentic workflows without an architecture change. The 30% thinking-token reduction is Moonshot's own number, but the integration path is low-risk enough to test against your own workloads before committing.

The practical question is whether those efficiency gains hold on a team's own task distribution. Running K2.7-Code against your own workloads before adjusting gateway weights is the low-risk path to finding out.

© 2026 Now Let Us. All rights reserved.

Source: VentureBeat

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Anthropic blocks all public access to Claude Fable 5, Mythos 5 following US government order — what enterprises should do

startups-vc

Anthropic blocks all public access to Claude Fable 5, Mythos 5 following US government order — what enterprises should do

Following an unprecedented US government export control directive, Anthropic has globally suspended all access to its newly released Claude Fable 5 and Mythos 5 models. This sudden blackout highlights the urgent need for enterprises to diversify their AI supply chains and adopt model-agnostic architectures.

NOW LET US Related – Google researchers introduce 'faithful uncertainty,' allowing LLMs to offer best guesses instead of hallucinations

startups-vc

Google researchers introduce 'faithful uncertainty,' allowing LLMs to offer best guesses instead of hallucinations

Google researchers have introduced 'faithful uncertainty,' a metacognitive technique that aligns an LLM's response with its internal confidence, allowing models to offer hedged hypotheses instead of defaulting to hallucinations or unhelpful silence.

NOW LET US Related – PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x

startups-vc

PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x

A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper introducing PixelRAG, a system that skips text parsing entirely by rendering pages as screenshots. It outperforms traditional text-based RAG in accuracy while slashing AI agent token costs by 10x.

NOW LET US Related – Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights

startups-vc

Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights

Microsoft has open-sourced SkillOpt, a framework that automatically optimizes AI agent skills using deep-learning-style controls. By treating skill documents as trainable objects, it boosts agent performance without altering the underlying model's weights.

NOW LET US Related – Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

startups-vc

Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

Context windows are becoming a computational bottleneck. A new family of encoder-decoder compression models called LCLMs compresses input context 16x, delivering 8.8x faster outputs without the typical accuracy drop.

NOW LET US Related – Beyond Instagram: Introducing the next generation of social apps

startups-vc

Beyond Instagram: Introducing the next generation of social apps

For years, social media has been dominated by Big Tech giants, but a new wave of startups is building smaller, more personal platforms. These innovative apps cater to younger generations looking for tighter-knit communities and niche connections.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.