BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

A new evaluation suite called BayesBench reveals that while scaling LLMs improves their latent inference and evidence accumulation, a significant gap remains in translating these gains into rational downstream predictions.

Computer Science > Artificial Intelligence

Title:BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

View PDFAbstract:Large language models (LLMs) are typically deployed in multi-turn conversations, where each turn provides new evidence that should reduce epistemic uncertainty about their environment. Acting rationally then requires inferring the unobserved quantities that govern it and updating beliefs about them as evidence accumulates. Yet most evaluations only score the model's final-turn answer in a single-turn format, leaving this process unexamined. We ask how closely LLMs' belief updates match those of a rational Bayesian reasoner in multi-turn settings, and introduce BayesBench, a suite of simulation environments that probe this across three progressively complex tasks: (i) Bayesian estimation, where the model infers an unknown parameter from sequential evidence; (ii) Bayesian prediction, where the model turns inferred beliefs about a latent variable into outcome forecasts; and (iii) latent-framed Bayesian prediction, where observations are filtered through a user-persona framing, requiring joint inference over the latent state and the persona. Across seven LLMs (3B--70B), scaling improves latent inference and evidence accumulation, with updates occasionally matching the Bayesian posterior. However, these gains do not reliably carry over to downstream prediction, exposing a gap between inferring latent structure and using it to rationally update beliefs about the target outcome.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Investigating Multi-Agent Deliberation in Law

A new study investigates the potential of multi-agent AI systems in the legal domain. By simulating courtroom procedures and legal argumentation, this approach opens up new ways to solve complex cases requiring multi-perspective critical thinking.

agentic-systems

HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

Researchers introduce HyPOLE, a novel framework that guides Multi-Agent Reinforcement Learning (MARL) under partial observability using formal specifications and HyperLTL temporal logic, outperforming traditional baselines.

agentic-systems

What Drives Interactive Improvement from Feedback?

A new study reveals that multi-turn improvements in LLMs are often driven by repeated attempts rather than feedback quality, highlighting that the student model's ability to act on feedback is the primary bottleneck.

agentic-systems

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

Researchers have introduced MultiUAV-Plat, a breakthrough simulation and benchmarking platform for LLM-based multi-UAV collaborative task planning, alongside the Agent4Drone framework which significantly improves task success rates.

agentic-systems

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

Researchers have introduced AgRefactor, an LLM-based multi-agent workflow that automates the refactoring of software code into HLS-compatible programs. Featuring a self-evolving memory system and tool integration, AgRefactor outperforms existing solutions and paves the way for automated chip design.

agentic-systems

Contrastive Reflection for Iterative Prompt Optimization

Researchers have introduced Contrastive Reflection, an iterative prompt-optimization framework for agentic information retrieval workflows. By comparing failed and successful execution traces, the method improves exact-match accuracy on HotpotQA from 51.4% to 60.4%.

EXPLORE TOPICS