DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%

DeepSeek has open-sourced DSpark, an MIT-licensed framework that speeds up LLM inference by up to 85% using speculative decoding. The system aims to solve the high cost of AI deployment by making large models serve users faster and more efficiently.

Even as the geopolitical conversation around AI continues to grow more fraught following the U.S. government's actions to limit the new models from Anthropic and OpenAI, Chinese open source darling DeepSeek is back with yet another open release that could once again change AI development around the globe.

Over the weekend, the firm released DSpark, a new, MIT-Licensed system designed to make large language models answer faster without changing what the underlying model is trying to say.

The easiest way to think about it is this: most AI chatbots write like someone crossing a river one stepping stone at a time. They choose one small chunk of text, then the next, then the next.

DSpark gives the system a scout that runs a few steps ahead, guesses the likely path, and lets the larger model quickly check which steps are safe. When the guesses are good, the model moves faster. When the guesses are weak, DSpark tries not to waste time checking them.

DeepSeek published the work with a technical paper, model checkpoints and DeepSpec, a codebase for training and evaluating speculative decoding systems. The release is available through DeepSeek’s public GitHub and Hugging Face pages, both under the permissive, friendly, commonplace MIT license, making the new technique broadly usable by developers, researchers and commercial enterprise operations that want to study or adapt the approach.

The system is aimed at one of the most expensive problems in AI deployment: serving large models quickly enough for real users, while using hardware efficiently enough to make the economics work. That matters for consumer chatbots, coding assistants, agentic workflows and enterprise AI systems where users expect long answers to stream quickly rather than crawl out word by word.

DeepSeek is applying DSpark to its own latest frontier open model, DeepSeek-V4.

Specifically, DeepSeek used its new DSpark framework on DeepSeek-V4-Flash, its already speed-optimized 284-billion-parameter mixture-of-experts model with 13 billion active parameters, and DeepSeek-V4-Pro, its more thoughtful and powerful 1.6-trillion-parameter model with 49 billion active parameters (Both support context windows up to one million tokens).

But the broader significance is that* DSpark is not conceptually limited to DeepSeek-V4.* DeepSeek’s own tests and released checkpoints cover other open model families, including Alibaba's open weights Qwen and Google's open weights *Gemma. *

That means enterprise teams running open-weight models could, in principle, train or fine-tune DSpark-style draft modules for their own target models. It is not a switch that any API customer can flip from the outside, but it is a method that can travel to other models when the operator controls the weights and serving stack.

Staggering speed increases for generating tokens during inference

In DeepSeek’s live production tests, DSpark improved aggregate throughput by 51% for DeepSeek-V4-Flash at an 80-token-per-second-per-user service target, and by 52% for DeepSeek-V4-Pro at a 35-token-per-second-per-user target. At matched system capacity, DeepSeek reports per-user generation speedups of 60% to 85% for V4-Flash and 57% to 78% for V4-Pro over its prior MTP-1 production baseline.

The different speed claims measure different things. The 60% to 85% figure for V4-Flash, and the 57% to 78% figure for V4-Pro, describe how much faster individual users receive generated tokens when DeepSeek compares DSpark with MTP-1 at matched practical system capacity.

Those are the cleaner “generation speed” numbers. DeepSeek also reports much larger 661% and 406% increases, but these measure aggregate throughput under very strict speed targets: 120 tokens per second per user for V4-Flash and 50 tokens per second per user for V4-Pro.

At those targets, DeepSeek says its older MTP-1 baseline approaches an operational cliff, meaning it can keep only a small number of concurrent requests running while preserving that level of responsiveness.

DSpark avoids more of that collapse, so the percentage difference in total system output becomes much larger. Put simply: the 85% number is closer to “how much faster the ride feels for a user” under comparable conditions, while the 661% and 406% figures are closer to “how much more traffic the road can still carry” when the old system is already bottlenecking.

Why speculative decoding matters

LLMs usually generate text one token at a time. A token can be a word, part of a word, punctuation mark or other small piece of text. Every new token depends on the text already produced, so the model has to keep pausing, checking the full context and choosing the next piece.

That is accurate, but slow. It is like having a senior editor approve every word before a writer can move to the next one. The editor may be excellent, but the process creates a bottleneck.

Speculative decoding, developed in the early Transfomer era, tries to fix that bottleneck. Instead of asking the large model to produce every token one by one, the system uses a smaller or lighter draft component to suggest several likely next tokens. The large model then checks that batch of guesses in parallel. If the draft guessed correctly, the system moves ahead several tokens at once. If the draft made a bad guess, the system rejects the bad token and anything after it, adds a corrected token, and tries again.

The point is speed without changing the larger model’s intended output. In the standard speculative decoding setup, the draft model is not replacing the target model. It is acting more like an assistant who prepares a rough next sentence for the senior editor to approve or reject.

The idea did not appear out of nowhere with today’s large language models. A key precursor came in 2018, when Mitchell Stern, Noam Shazeer and Jakob Uszkoreit proposed blockwise parallel decoding for deep autoregressive models. Their method predicted multiple future steps in parallel, then kept the longest prefix validated by the main model. That paper established much of the draft-and-check intuition behind later speculative decoding work.

The research line became more explicit in 2022. Heming Xia, Tao Ge and co-authors introduced SpecDec, a draft-and-verify approach for sequence-to-sequence generation. Later that year, Yaniv Leviathan, Matan Kalman and Yossi Matias posted “Fast Inference from Transformers via Speculative Decoding,” which helped define the modern version of the technique for transformer-based language models. DeepMind researchers followed in 2023 with a closely related method called speculative sampling.

Those 2022 and 2023 papers are the clearest ancestors of how speculative decoding is discussed in current LLM inference work: a faster draft process proposes tokens, and the larger target model verifies them in a way designed to preserve the target model’s output distribution.

Since then, the field has moved quickly through several variants, including separate draft models, multi-token prediction heads, tree-based verification, feature-level methods such as EAGLE, self-speculation, Medusa-style extra heads and parallel/blockwise drafters such as DFlash.

The key metric is not how many tokens a draft model can guess. It is how many of those guesses the larger model actually accepts. Long speculative blocks help only if enough of the proposed tokens survive verification. Otherwise, the system spends compute checking guesses that it throws away.

That is the context for DSpark. Speculative decoding is already an established inference technique before DeepSeek’s release, with support in major serving stacks and multiple competing research approaches. But it is still not a solved problem. Speedups depend heavily on the draft model, the workload, the serving setup and the current traffic level. DSpark’s contribution is to improve both sides of the trade-off: it tries to draft more coherent token blocks and then verify only the parts of those

Source: VentureBeat

More in this category

Researchers introduce Self-Harness, a framework that lets AI agents rewrite their own rules, boosting performance up to 60%

7,000 Langflow servers are under attack. LangGraph and LangChain have the same holes

Fine-tuning forgets. RAG leaks context. Hypernetworks build the model your agent needs on demand.

Adobe embeds agentic AI workflows across Creative Cloud, shifting from media generation to production orchestration

AWS enters the context layer race with a graph that learns from agents, not manual curation

Databricks says it solved the decades-old data pipeline problem that's been slowing AI agents

Discover All Categories