Detecting and Controlling Sycophancy with Cascancy with Cascading Linear Features

Researchers have proposed a new iterative data generation pipeline to detect and control sycophancy in large language models using cascading linear features, outperforming traditional baselines with lower computational costs.

Computer Science > Artificial Intelligence

Title:Detecting and Controlling Sycophancy with Cascading Linear Features

View PDFAbstract:Interpreting and controlling model behaviors through activation steering methods requires many pairs of contrastive samples that clearly exhibit desired or undesired behavior. These data pairs determine the degree to which interpretability frameworks can reliably detect model features responsible for a behavior, and therefore the ability to steer models toward or away from such behavior. In this work, we present an iterative data generation pipeline that isolates cascading linear features responsible for a behavior. Specifically, we show how moving beyond simple binary pairs of samples, and instead isolating samples that show degrees of features that scale linearly with behavior, allows for better disentanglement of features. We focus on detecting and steering away from sycophancy -- the tendency of language models to prioritize user validation. We demonstrate that sycophancy features discovered through cascading samples form linearly separable subspaces, and allow for selection of model activations that more clearly correspond to the desired behavior than baseline approaches. We also evaluate their ability to enable detection, deterministic scoring, and robust steering, and see that they either match or outperform LLM-as-a-judge and system prompting baselines while providing lower computational demand and more interpretability guarantees. Code & Data: this https URL

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Refusal Lives Downstream of Persona in Chat Models

Researchers have discovered that the refusal mechanism in large language models is not an isolated feature but is gated by a "compliant persona" at late processing layers, challenging traditional views on AI safety alignment.

NOW LET US Related – When Agents Meet Electric Bus Fleet Operations: Pricing Behavior, Trade-offs, and Policy Implications in an Aggregator Framework

agentic-systems

When Agents Meet Electric Bus Fleet Operations: Pricing Behavior, Trade-offs, and Policy Implications in an Aggregator Framework

A new study proposes an agentic AI framework to optimize charging and vehicle-to-grid (V2G) operations for electric bus fleets. While improving operational efficiency, the technology introduces trade-offs in value distribution, highlighting the need for transparent policy frameworks.

agentic-systems

Unbiased Canonical Set-Valued Oracles Via Lattice Theory

Researchers propose a novel approach using mathematical lattice theory to resolve the self-reference paradox in predictive AI oracles. Instead of a single point probability, the AI outputs an unbiased, self-consistent credal set.

agentic-systems

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

A new study argues that current static leaderboards fail to predict how LLM agents perform in real-world deployments, proposing a new evaluation framework based on predictive validity.

agentic-systems

Deontic Policies for Runtime Governance of Agentic AI Systems

Autonomous agentic AI systems introduce novel security and compliance challenges that exceed the capabilities of current policy engines. To address this, researchers propose AgenticRei, a runtime governance framework utilizing deontic policies to strictly control AI behavior outside the LLM.

agentic-systems

Emergent Alignment

Researchers have introduced "Emergent Alignment," a method enabling Large Language Models (LLMs) to detect and self-correct their own unethical outputs. By integrating a "conscience step" and DPO optimization, this technique helps AI maintain ethical standards without relying on external judge models.

EXPLORE TOPICS