AGENTIC-SYSTEMSJune 8, 20261 min read9 views

SafeGene: Reusable Adapters for Transferable Safety Alignment

Fine-tuning open-weight LLMs often inadvertently degrades their safety alignment, making them vulnerable to malicious prompts. SafeGene addresses this by introducing a reusable safety-adapter module that restores safety across various downstream tasks without compromising model performance.

Computer Science > Artificial Intelligence

Title:SafeGene: Reusable Adapters for Transferable Safety Alignment

View PDF HTML (experimental)Abstract:Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Profile-Graph Memory for LLM Agents: Implicit Cross-Entity Traversal through Narrative Profiles

Researchers introduce ProGraph, a novel two-layer memory architecture designed to enhance multi-hop reasoning and precise recall for LLM agents. Alongside ProGraph, the team released MemHop, a new benchmark tailored for evaluating complex, multi-step memory retrieval.

agentic-systems

NEXUS: Structured Runtime Safety for Tool-Using LLM Agents

NEXUS is a novel structured-plan safety monitor designed for tool-using LLM agents. By combining deterministic rules and calibrated risk scoring, NEXUS enforces graded intervention policies with sub-millisecond latency.

agentic-systems

Information Discernment in Large Language Models

A new study introducing the Learn2Discern (L2D) benchmark reveals that large language models struggle to evaluate source reliability and truthfulness when integrating external knowledge. Researchers found that models often prioritize source popularity over reliability, though simple inference-time interventions can help mitigate these issues.

agentic-systems

OpenEvoShield: Dual Non-Stationary Continual Defense for Open-World Multi-Agent System Attacks

OpenEvoShield is a co-evolutionary continual defense framework designed to protect LLM-based multi-agent systems from dynamic injection attacks in open-world environments by addressing both fast attack evolution and slow normal behavior drift.

agentic-systems

Stochastic Primal-Dual Decoding for Multiobjective Generative Recommender Systems

Researchers introduced a lightweight inference-time decoding layer for generative recommender systems to optimize multi-objective slates without retraining. The stochastic primal-dual approach dynamically balances relevance and constraints, achieving a +1.8% gain in auxiliary objectives in real-world deployment.

agentic-systems

LISA: Linear-Indexed Sparse Attention for Efficient Long-Context Reasoning

Researchers introduce LISA, a plug-and-play attention module that reduces computational complexity from O(n^2) to O(nM), achieving a 50% inference speedup and a 5.6% accuracy gain on reasoning benchmarks for long-context models.