AGENTIC-SYSTEMSMarch 20, 20261 min read23 views

How Uncertainty Estimation Scales with Sampling in Reasoning Models

The study explores how reasoning models estimate uncertainty using verbalized confidence and self-consistency, finding that a hybrid approach significantly outperforms individual methods even with minimal sampling.

Computer Science > Artificial Intelligence

Title: How Uncertainty Estimation Scales with Sampling in Reasoning Models

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale.

Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to +12 on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Hybrid LSTM-Graph Neural Framework for Robust Financial Fraud Detection and Adversarial Resilience

Researchers proposed FraudShield AI, a novel hybrid framework combining LSTM networks and graph topological features to combat sophisticated financial fraud. The model shifts fraud detection to network-level forensics, significantly outperforming traditional algorithms like XGBoost.

agentic-systems

AdaRoPE: Not All Attention Heads Should Rotate and Scale Equally

Researchers have introduced AdaRoPE, a novel position embedding method that equips individual attention heads in Transformer models with learnable rotation frequencies and scaling factors, significantly improving context extension capabilities.

agentic-systems

Probabilistic Concept-Aware Steering for Trustworthy LLM Inference

Researchers have introduced Probabilistic Concept-Aware Steering (PCS), an inference-time intervention framework for LLMs. PCS provides fine-grained, safety-oriented semantic steering while preserving original task competence.

agentic-systems

S2T-RLHF: Hierarchical Credit Assignment for Stable Preference-Based RLHF

Researchers introduced S2T-RLHF, a hierarchical credit assignment framework that stabilizes preference-based RLHF training by decomposing response-level rewards into sentence and token levels.

agentic-systems

ProbSPARQL: Querying Knowledge Graphs with Multi-dimensional, Uncertain Numeric Data

Researchers have introduced ProbSPARQL, an upward-compatible SPARQL extension designed to query multi-dimensional and uncertain numeric measurement data within Knowledge Graphs, providing significant query performance gains for complex industrial applications.

agentic-systems

MUX: Continuous Reasoning via Multiplexed Tokens

Researchers introduced MUX, a novel latent reasoning method that compresses discrete text-based reasoning steps into continuous multiplexed tokens. By enabling lossless superposition, MUX significantly boosts LLM reasoning efficiency and speed across complex problem-solving tasks.