NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Share
NOW LET US Article – HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Researchers have introduced HARC, a new fine-tuning method that enhances the safety of Large Language Models (LLMs). By coupling "harmfulness" and "refusal" directions within the model's internal representations, HARC effectively prevents jailbreak attacks without degrading general performance.

Computer Science > Artificial Intelligence

Title:HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

View PDF HTML (experimental)Abstract:Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions. We show that jailbreaks succeed at prompt encoding by suppressing either the refusal or harmfulness direction before any token is generated, with distinct attack classes occupying separable regions of the harmfulness-refusal plane. Extending the analysis to response-token positions, we find that the model recognizes harmful content while it is generating that content, even when it failed to recognize the input as harmful at the prompt side. Motivated by our findings, we introduce HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method that pairs the two directions across both prompt and response positions. Since the intervention is confined to the harmfulness-refusal subspace, it leaves the rest of the residual stream intact and does not degrade general capability or inflate over-refusal. Across extensive experiments, HARC achieves the strongest robustness-capability-usability trade-off among six baselines spanning the major training-time and inference-time safety methods. The harmfulness and refusal directions at prompt and response positions transfer across the five model families and two scales we tested without architecture-specific tuning.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

agentic-systems

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

Researchers introduce Mnemosyne, an open-source runtime utilizing Agentic Transaction Processing (ATP) to validate and repair AI-generated workflows, ensuring system correctness and safety against untrusted proposals from Large Language Models (LLMs).

NOW LET US Related – Solution space path planning for supporting en-route air traffic control

agentic-systems

Solution space path planning for supporting en-route air traffic control

Researchers have developed a novel solution-space path-planning algorithm designed to support en-route air traffic controllers by aligning with human decision logic. The algorithm achieves conflict-free path generation in just 3.69 milliseconds, significantly improving computational efficiency and operational safety.

NOW LET US Related – AGI Maze as a Benchmark Framework for World-Modeling Agents

agentic-systems

AGI Maze as a Benchmark Framework for World-Modeling Agents

A new research paper introduces AGI Maze, a benchmark framework designed to evaluate how AI agents build and manipulate internal world models. Initial evaluations show that even powerful LLMs struggle to solve simple mazes that humans can easily navigate.

NOW LET US Related – AI Native Games: A Survey and Roadmap

agentic-systems

AI Native Games: A Survey and Roadmap

A new research paper defines 'AI-native games' where generative AI is core to the gameplay loop, analyzing 53 projects to map out a development roadmap for this emerging sector.

NOW LET US Related – Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Context-Aware Agricultural Advisory Generation

agentic-systems

Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Context-Aware Agricultural Advisory Generation

Agri-SAGE is a closed-loop framework that integrates multi-agent LLM reasoning with APSIM biophysical simulation to generate and validate highly accurate, context-aware agricultural advisories.

NOW LET US Related – Coachable agents for interactive gameplay

agentic-systems

Coachable agents for interactive gameplay

Researchers have developed a new framework that allows real-time control over the styles and behaviors of AI agents. The framework has been successfully demonstrated in AAA games like Horizon Forbidden West and Gran Turismo.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.