A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

Researchers have introduced a new game-theoretic model to optimize real-time human oversight of AI agents under two-sided informational asymmetry, where the human privately knows the reward function and the AI privately assesses the quality of its proposed actions.

Computer Science > Artificial Intelligence

Title:A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

View PDF HTML (experimental)Abstract:We study runtime human oversight of an AI agent when private information runs in both directions: the human privately knows her reward function, while the AI privately knows the quality of the action it proposes. This is the kind of asymmetry that arises naturally when an autonomous robot or software agent has inspected a situation its human supervisor cannot directly assess. Building on Cooperative Inverse Reinforcement Learning (CIRL) and the Oversight Game, we introduce a contextual-bandit team game with two-sided asymmetric information and a play/ask/trust/oversee interface. The bandit structure removes physical state transitions and thereby yields exact one-shot characterizations that would remain conjectural in the full POMDP setting, though the common belief remains a dynamically controlled state across rounds. We give two one-shot characterizations, a team optimum and a behaviorally natural myopic rule, whose gap is a slab of avoidable harm: a region in which the AI privately knows the proposed action is harmful and shutdown would help, yet a myopic human, trusting her prior, declines to oversee. We show this gap is the price of non-credible oversight communication, and give a partial analysis of how it resolves dynamically over repeated rounds through passive learning and active signaling with a one-period-lagged oversight response.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

Researchers have proposed a new constrained, verifiable agent framework that shifts LLM output from free-form code to typed JSON configurations, addressing common web scraping errors. This approach minimizes operational costs by using zero LLM tokens during execution while ensuring high reusability.

agentic-systems

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

Researchers introduce Mnemosyne, an open-source runtime utilizing Agentic Transaction Processing (ATP) to validate and repair AI-generated workflows, ensuring system correctness and safety against untrusted proposals from Large Language Models (LLMs).

agentic-systems

Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction

Researchers propose 'Constructive Alignment', a new paradigm that reframes AI alignment as managing evolving human preference trajectories rather than satisfying static desires.

agentic-systems

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Researchers have introduced HARC, a new fine-tuning method that enhances the safety of Large Language Models (LLMs). By coupling "harmfulness" and "refusal" directions within the model's internal representations, HARC effectively prevents jailbreak attacks without degrading general performance.

agentic-systems

Solution space path planning for supporting en-route air traffic control

Researchers have developed a novel solution-space path-planning algorithm designed to support en-route air traffic controllers by aligning with human decision logic. The algorithm achieves conflict-free path generation in just 3.69 milliseconds, significantly improving computational efficiency and operational safety.

agentic-systems

AGI Maze as a Benchmark Framework for World-Modeling Agents

A new research paper introduces AGI Maze, a benchmark framework designed to evaluate how AI agents build and manipulate internal world models. Initial evaluations show that even powerful LLMs struggle to solve simple mazes that humans can easily navigate.

EXPLORE TOPICS