AGENTIC-SYSTEMSJune 8, 20261 min read12 views

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

A new study reveals that current AI safety evaluation frameworks overestimate security by assuming non-strategic attackers. By strategically choosing when to start and abort attacks, rogue AI agents can easily evade monitoring systems, significantly reducing empirical safety.

Computer Science > Artificial Intelligence

Title: Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack. We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability. At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

GraphContainer: A Unified Platform for Comparing and Debugging Graph RAG Methods

GraphContainer is a novel platform designed to unify and visualize diverse graph RAG workflows for LLMs. It features a Unified Graph Representation (UGR) layer and a Graph Recorder for live, traceable visual debugging and controlled comparison of graph retrieval strategies.

agentic-systems

Profile-Graph Memory for LLM Agents: Implicit Cross-Entity Traversal through Narrative Profiles

Researchers introduce ProGraph, a novel two-layer memory architecture designed to enhance multi-hop reasoning and precise recall for LLM agents. Alongside ProGraph, the team released MemHop, a new benchmark tailored for evaluating complex, multi-step memory retrieval.

agentic-systems

NEXUS: Structured Runtime Safety for Tool-Using LLM Agents

NEXUS is a novel structured-plan safety monitor designed for tool-using LLM agents. By combining deterministic rules and calibrated risk scoring, NEXUS enforces graded intervention policies with sub-millisecond latency.

agentic-systems

Information Discernment in Large Language Models

A new study introducing the Learn2Discern (L2D) benchmark reveals that large language models struggle to evaluate source reliability and truthfulness when integrating external knowledge. Researchers found that models often prioritize source popularity over reliability, though simple inference-time interventions can help mitigate these issues.

agentic-systems

OpenEvoShield: Dual Non-Stationary Continual Defense for Open-World Multi-Agent System Attacks

OpenEvoShield is a co-evolutionary continual defense framework designed to protect LLM-based multi-agent systems from dynamic injection attacks in open-world environments by addressing both fast attack evolution and slow normal behavior drift.

agentic-systems

Stochastic Primal-Dual Decoding for Multiobjective Generative Recommender Systems

Researchers introduced a lightweight inference-time decoding layer for generative recommender systems to optimize multi-objective slates without retraining. The stochastic primal-dual approach dynamically balances relevance and constraints, achieving a +1.8% gain in auxiliary objectives in real-world deployment.