Scaling Trends for Lie Detector Oversight in Preference Learning

A new study evaluates Scalable Oversight via Lie Detectors (SOLiD) on larger LLMs, showing that scaling reduces undetected deception to 14% and can eliminate the need for expensive human labelers during fine-tuning.

Computer Science > Artificial Intelligence

Title:Scaling Trends for Lie Detector Oversight in Preference Learning

View PDF HTML (experimental)Abstract:Deceptive behavior in LLMs is costly to monitor and prevent, motivating approaches such as Scalable Oversight via Lie Detectors (SOLiD) (Cundy & Gleave, 2025), which uses lie detectors to identify responses for review by high-cost labelers. In this paper, we scale SOLiD to larger models and evaluate it in more diverse and realistic preference-learning settings. We find favorable scaling: undetected deception drops from 34% for 1B-parameter models to 14% for 405B-parameter models at a detector true positive rate of 99%, and expensive human labelers can be removed entirely from the fine-tuning phase without a statistically significant increase in deception. However, SOLiD is sensitive to distribution shift between detector training and preference-training data, which can drive detector false positive rates to impractical levels.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

The Agentic Garden of Forking Paths

A new study reveals that AI agents can produce divergent, opposing scientific conclusions from the same dataset simply by being assigned different personas. To address this challenge to scientific credibility, researchers propose 'Agentic Bootstrap' to map the entire distribution of possible analytical paths.

agentic-systems

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

The paper diagnoses the challenges of applying Reinforcement Learning (RL) to clinical agents in FHIR environments, introducing MedAgentBench-v3 to address feedback flaws and proposing a hybrid SFT-RL approach.

agentic-systems

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Researchers have adapted a mixture-of-experts diffusion language model for medical applications, matching or exceeding traditional autoregressive models while decoding 3.5 to 4.4 times faster and enabling flexible, non-linear report drafting.

agentic-systems

OPINE-World: Programmatic World Modeling with Ontology-error-Prioritized Interactive Exploration

Researchers have introduced OPINE-World, a breakthrough LLM agent that learns an object-centric programmatic world model online through interaction. By guiding exploration with a novel 'ontology error' metric, it overcomes the data-hungry nature of traditional deep networks and achieves high efficiency on the ARC-AGI-3 benchmark.

agentic-systems

Janus: a Playground for User-Involved Agentic Permission Management

As AI agents autonomously execute tools, managing permissions becomes a critical challenge. Janus is introduced as a playground system consisting of Janus-Core and Janus-Harness to implement and evaluate user-involved permission management designs.

agentic-systems

Profit-Based Counterfactual Explanations for Product Improvement: A Case Study of Manga Sales in Japan

Researchers propose a novel Profit-Based Counterfactual Explanation (PBCE) framework that integrates machine learning with business profit maximization, demonstrated through a case study on Japanese manga sales.

EXPLORE TOPICS