**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

SPEED-Bench is a unified benchmark designed to evaluate Speculative Decoding across diverse semantic domains and realistic serving regimes using production-grade inference engines.

SD uses a lightweight draft model to speculate multiple future tokens, which are then verified in parallel by the target model. This way, SD can significantly improve throughput while preserving the exact output distribution of the target model.

Despite rapid progress in SD algorithms, their evaluation remains fragmented and often unrepresentative of real-world data and serving conditions.

In practice, SD speculation quality and inference speedups are inherently data-dependent, serving-regime–dependent, and system-dependent.

Yet most existing benchmarks rely on small prompt sets, limited semantic diversity, short input sequence lengths, batch size one, or high-level inference stacks that do not reflect production environments.

To address these gaps, we introduce SPEED-Bench: a unified benchmark designed to evaluate SD across diverse semantic domains and realistic serving regimes, using production-grade inference engines.

SD must be evaluated from two perspectives.

On one hand, draft quality depends on the semantic domain and entropy of the input text.

On the other hand, real-world speedups depend on batch size, input sequence length (ISL), and system constraints, which determine whether inference is memory-bound or compute-bound.

SPEED-Bench therefore introduces a benchmarking ecosystem for SD.

It combines two purpose-built dataset splits and a unified measurement framework, each designed to capture a different aspect of SD behavior:

A “Qualitative” data split, optimized for semantic diversity and designed to measure speculation quality (drafter accuracy) across domains.
A “Throughput” data split, constructed to evaluate system-level speedups across various input sequence lengths and high concurrency.
A unified measurement framework, integrated with production inference engines, that standardizes evaluation across systems.

Together, these components enable practitioners and researchers to analyze SD behavior that is often masked by existing benchmarks.

The goal of the Qualitative split is to measure speculative decoding quality, specifically conditional acceptance rates (ARs) and acceptance lengths (ALs), across a wide range of semantic domains.

SpecBench introduced the first unified SD benchmark across diverse application scenarios, such as multi-turn conversation, translation, and mathematical reasoning, by aggregating instances from widely used datasets into a unified testing environment. However, despite being a significant step toward standardized evaluations, it has critical limitations regarding scale and diversity. Most categories contain as few as 10 samples with short mean input lengths (< 100 tokens) that may fail to stress modern drafters. Additionally, some of its categories often lack structural diversity, such as the multilingual category consisting entirely of German-to-English translation prompts.

While extensive evaluation across numerous datasets is theoretically possible, it is tedious, impractical for rapid experimentation, and hinders direct comparisons between different research groups releasing SD algorithms and models. Instead of relying on exhaustive evaluations across disparate datasets, we curate a compact yet highly representative subset designed to maximize semantic diversity.

We aggregate data from 18 publicly available sources and organize it into 11 categories, including Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA.

Each category contains 80 samples, resulting in a total of 880 prompts.

Unlike prior benchmarks, which often suffer from low intra-category diversity, the SPEED-Bench Qualitative split explicitly prioritizes semantic diversity. To achieve this, each candidate prompt is embedded into a dense vector space using a pretrained text embedder. We then apply a selection algorithm that minimizes average pairwise cosine similarity within each category. This ensures that the selected samples span the semantic space as widely as possible, reducing redundancy and increasing evaluation fidelity.

This semantic diversity is critical for exposing domain-dependent behavior in SD, such as the strong contrast between low-entropy domains (e.g., Coding, Math) and high-entropy domains (e.g., Roleplay, Writing).

While the Qualitative split captures draft accuracy, it is insufficient for evaluating system-level speedups. We evaluate system-level speedups using two metrics: Throughput (Output TPS), the total tokens generated per second across all concurrent requests, and User TPS, the per-request token generation rate. User TPS acts as a proxy for end-user latency.

In production environments, models are served under high concurrency and a wide range of input sequence lengths, which are often much longer than the short ISL samples used in many SD benchmarks. As batch size increases, inference often transitions from a compute-bound regime to a memory-bound regime, fundamentally changing the cost-benefit trade-offs of speculative decoding.

The Throughput split is designed specifically to capture this behavior. We construct fixed ISL buckets ranging from 1k to 32k tokens, reflecting the growing importance of long-context applications such as coding assistants and retrieval-augmented generation.

For each ISL bucket, prompts are aggregated into three coarse difficulty categories corresponding to low-, mixed-, and high-entropy domains. Each ISL bucket contains 1,536 prompts (512 per difficulty category), providing sufficient volume to construct stable throughput Pareto curves across a wide range of batch sizes. Importantly, SPEED-Bench avoids the use of random token inputs for throughput benchmarking. As we show later, random tokens can severely distort acceptance behavior, expert routing in MoE models, and throughput measurements, leading to overly optimistic conclusions.

Benchmarking SD across inference engines presents a subtle but critical challenge. Different engines may apply different chat templates, handle BOS tokens differently, or tokenize inputs inconsistently. These differences can silently alter the drafted sequence, making cross-engine comparisons unreliable. SPEED-Bench introduces a lightweight measurement framework that handles tokenization and prompt formatting externally. Inference engines receive pre-tokenized sequences, ensuring that all systems process identical inputs. The framework integrates with production-grade engines: TensorRT-LLM, vLLM, and SGLang.

Source: Hugging Face Blog