Agents of Chaos

A recent red-teaming study reveals critical security and privacy vulnerabilities in autonomous AI agents, documenting failures ranging from sensitive data leaks to unauthorized system takeovers.

Agents of Chaos

Abstract

We report an exploratory red-teaming study of autonomous language-model–powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case studies. Observed behaviors include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. In several cases, agents reported task completion while the underlying system state contradicted those reports. We also report on some of the failed attempts. Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings. These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines. This report serves as an initial empirical contribution to that broader conversation.

Introduction

LLM-powered AI agents are rapidly becoming more capable and more widely deployed. Unlike conventional chat assistants, these systems are increasingly given direct access to execution tools (code, shells, filesystems, browsers, and external services), so they do not merely describe actions, they perform them. This shift is exemplified by increasingly capable LLM-based agents such as Claude Code, Codex, Manus, Letta, and OpenClaw.

In this work, we focus on OpenClaw, an open-source framework that connects language models to persistent memory, tool execution, scheduling, and messaging channels.

Increased autonomy and access create qualitatively new safety and security risks, because small conceptual mistakes can be amplified into irreversible system-level actions. Even when the underlying model is strong at isolated tasks (e.g., software engineering, theorem proving, or research assistance), the agentic layer introduces new failure surfaces at the interface between language, tools, memory, and delegated authority. Furthermore, as agent-to-agent interaction becomes common, this raises risks of coordination failures and emergent multi-agent dynamics. Yet, existing evaluations and benchmarks for agent safety are often too constrained and rarely stress-tested in messy, socially embedded settings.

While public discourse about this new technology already varies widely, these systems are already widely deployed in and interacting with real-world environments. This includes Moltbook, a Reddit-style social platform restricted to AI agents that garnered 2.6 million registered agents in its first weeks. Despite this, we have limited empirical grounding about which failures emerge in practice when agents operate continuously, interact with real humans and other agents, and have the ability to modify their own state and infrastructure.

To begin to address the gap, we present a set of applied case studies exploring AI agents deployed in an isolated server environment with a private Discord instance, individual email accounts, persistent storage, and system-level tool access. We recruited twenty researchers to interact with the agents during a two-week exploratory period and encouraged them to probe, stress-test, and attempt to “break” the systems in adversarial ways.

Across eleven case studies, we identified patterns of behavior that highlight the limitations of current agentic systems. These included instances of non-owner compliance leading to unintended access, denial-of-service–like resource consumption, and agent-to-agent libelous sharing. More broadly, we find repeated failures of social coherence: agents perform as misrepresenting human intent, authority, and ownership, and often perform as they have successfully completed requests while in practice they were not. These results reinforce the need for systematic oversight and realistic red-teaming for agentic systems.

Agent. We use “AI agent” to denote a language-model–powered entity able to plan and take actions to execute goals over multiple iterations. Mirsky defines six levels of autonomy from L0 to L5. The agents in our study appear to operate at Mirsky’s L2: they act autonomously on sub-tasks but lack the self-model required to reliably recognize when a task exceeds their competence or when they should defer to their owner (L3).

Source: Hacker News