Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

A new study compares Difference-in-Means (DiM) with Iterative Nullspace Projection (INLP) to steer LLM refusal, revealing that models encode the absence of a concept differently from its opposite.

Computer Science > Artificial Intelligence

Title:Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

View PDF HTML (experimental)Abstract:Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

Computer Science > Artificial Intelligence

Title:Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

More in this category

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

History of the Muddy Children Puzzle

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

When Sample Selection Bias Precipitates Model Collapse

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

Discover All Categories