When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

A new study introduces LearnStop, a cost-aware early exit framework for reasoning language models, analyzing when learned stopping rules outperform simple thresholds. The research reveals that while learned stopping excels in free-form math tasks, simpler scalar rules remain highly competitive in multiple-choice and extremely difficult scenarios.

Computer Science > Artificial Intelligence

Title:When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

View PDF HTML (experimental)Abstract:Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with LearnStop, a hidden-state-free checkpoint stopper for reasoning language models. At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking-marker density. Across 18 task-model settings spanning GSM8K, MATH-500, MMLU-Pro, AIME-90, GPQA, Qwen3, and DeepSeek-R1 distillations, the answer is task-dependent. On free-form math, learned multi-feature stopping improves the fixed-budget frontier and often beats scalar exits: on GSM8K with Qwen3-32B, the empirical frontier reaches a post-hoc peak adapt gain of +0.157, validation-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is +0.028. On multiple-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger. We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure. We further provide validation-selected operating points, paired bootstrap tests, finite-grid lost-correct risk calibration, cost accounting under KV-fork, prefix-cache, and black-box regimes, H100 serving profiles, checkpoint-schedule sweeps, transfer analyses, and robustness checks. The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Investigating Multi-Agent Deliberation in Law

A new study investigates the potential of multi-agent AI systems in the legal domain. By simulating courtroom procedures and legal argumentation, this approach opens up new ways to solve complex cases requiring multi-perspective critical thinking.

agentic-systems

HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

Researchers introduce HyPOLE, a novel framework that guides Multi-Agent Reinforcement Learning (MARL) under partial observability using formal specifications and HyperLTL temporal logic, outperforming traditional baselines.

agentic-systems

What Drives Interactive Improvement from Feedback?

A new study reveals that multi-turn improvements in LLMs are often driven by repeated attempts rather than feedback quality, highlighting that the student model's ability to act on feedback is the primary bottleneck.

agentic-systems

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

Researchers have introduced MultiUAV-Plat, a breakthrough simulation and benchmarking platform for LLM-based multi-UAV collaborative task planning, alongside the Agent4Drone framework which significantly improves task success rates.

agentic-systems

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

Researchers have introduced AgRefactor, an LLM-based multi-agent workflow that automates the refactoring of software code into HLS-compatible programs. Featuring a self-evolving memory system and tool integration, AgRefactor outperforms existing solutions and paves the way for automated chip design.

agentic-systems

Contrastive Reflection for Iterative Prompt Optimization

Researchers have introduced Contrastive Reflection, an iterative prompt-optimization framework for agentic information retrieval workflows. By comparing failed and successful execution traces, the method improves exact-match accuracy on HotpotQA from 51.4% to 60.4%.

EXPLORE TOPICS