AGENTIC-SYSTEMSApril 9, 20261 min read16 views

SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

Researchers have introduced SELFDOUBT, a breakthrough framework for quantifying uncertainty in reasoning LLMs by analyzing behavioral signals. This method significantly outperforms existing techniques in accuracy while reducing inference costs by 10x.

Computer Science > Artificial Intelligence

Title:SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

Uncertainty estimation for reasoning language models remains difficult to deploy in practice: sampling-based methods are computationally expensive, while common single-pass proxies such as verbalized confidence or trace length are often inconsistent across models. This problem is compounded for proprietary reasoning APIs that expose neither logits nor intermediate token probabilities, leaving practitioners with no reliable uncertainty signal at inference time. We propose SELFDOUBT, a single-pass uncertainty framework that resolves this impasse by extracting behavioral signals directly from the reasoning trace itself. Our key signal, the Hedge-to-Verify Ratio (HVR), detects whether a reasoning trace contains uncertainty markers and, if so, whether they are offset by explicit selfchecking behavior. Unlike methods that require multiple sampled traces or model internals, SELFDOUBT operates on a single observed reasoning trajectory, making it suitable for latency- and cost-constrained deployment over any proprietary API. We evaluate SELFDOUBT across seven models and three multi-step reasoning benchmarks (BBH, GPQA-Diamond, and MMLU-Pro). Most notably, traces containing no hedging markers are correct 96% of the time, revealing an emergent high-precision confidence gate at zero additional cost. For the remaining cases, the full SELFDOUBT score significantly outperforms sampling-based semantic entropy at 10x lower inference cost. A deployment cascade combining both stages attains 90% accuracy at 71% coverage without any task-specific labels. These results establish SELFDOUBT as a scalable, production-ready foundation for uncertainty estimation over proprietary reasoning models.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

FindStatBench: Evaluating Large Language Models on Combinatorial Code Synthesis

Researchers have introduced FindStatBench, a new execution benchmark containing 2,329 tasks designed to evaluate the combinatorial code synthesis capabilities of large language models.

agentic-systems

Probabilistic Concept-Aware Steering for Trustworthy LLM Inference

Researchers have introduced Probabilistic Concept-Aware Steering (PCS), an inference-time intervention framework for LLMs. PCS provides fine-grained, safety-oriented semantic steering while preserving original task competence.

agentic-systems

MUX: Continuous Reasoning via Multiplexed Tokens

Researchers introduced MUX, a novel latent reasoning method that compresses discrete text-based reasoning steps into continuous multiplexed tokens. By enabling lossless superposition, MUX significantly boosts LLM reasoning efficiency and speed across complex problem-solving tasks.

agentic-systems

S2T-RLHF: Hierarchical Credit Assignment for Stable Preference-Based RLHF

Researchers introduced S2T-RLHF, a hierarchical credit assignment framework that stabilizes preference-based RLHF training by decomposing response-level rewards into sentence and token levels.

agentic-systems

Semantic Cooperative Games for Contribution Attribution in LLM-Based Multi-Agent Systems

Researchers introduce Semantic Cooperative Games (SCG) and the SLIC algorithm to evaluate agent contributions in LLM-based multi-agent systems without re-running models, reducing computational costs by 93.3%.

agentic-systems

ProbSPARQL: Querying Knowledge Graphs with Multi-dimensional, Uncertain Numeric Data

Researchers have introduced ProbSPARQL, an upward-compatible SPARQL extension designed to query multi-dimensional and uncertain numeric measurement data within Knowledge Graphs, providing significant query performance gains for complex industrial applications.