NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...3 min read

Securing the future of AI agents

Share
NOW LET US Article – Securing the future of AI agents

Google has introduced its AI Control Roadmap, a defense-in-depth framework designed to secure internal systems against increasingly capable and imperfectly aligned AI agents.

How weâre securing internal systems against increasingly capable and imperfectly aligned AI

AI agents are transforming our relationship with technology. By autonomously executing complex tasks â from cyber defence to scientific discovery and product development â these systems are unlocking a new era of productivity. In the U.S alone, AI agents could create $2.9 trillion in economic value by 2030.

As these agents become more capable, they also require more sophisticated safeguards. Thatâs why we developed our AI Control Roadmap: a framework for building and managing the advanced AI we deploy within Google. This âdefense-in-depthâ approach, which could serve as a model for the wider industry, goes beyond traditional model alignment, adding a crucial layer of system-level security that provides assurance even if alignment is imperfect.

Understanding AI Control

Our approach to security starts with a strong foundation, incorporating traditional safeguards like sandboxing, endpoint security, and prompt injection resistance. On top of this, the AI Control Roadmap uses model alignment, i.e. training AI to be inherently safe and helpful, as a primary defense. It provides an additional layer of security by treating internal agents as potentially misaligned, providing assurance even if alignment is imperfect.

Think of it like a driving instructor with dual controls. The instructor trusts the student but stays ready to take the wheel or hit the brakes if a mistake occurs. Similarly, our AI control system grants AI agents permissions based on their verified behavior, allowing us to build trust through controlled, incremental access.

Our roadmap is built on a simple but cautious assumption: what if a highly capable AI agent isnât aligned with our goals and acts in unexpected ways? To address this, our plan focuses on three key areas.

First, following best practices in cybersecurity, we have developed a novel threat-modelling framework for AI. The roadmap treats untrusted AI agents as potential âinsider threatsâ â similar to the way a company would treat a potential rogue employee who already has access to the office. By building on the industry-standard MITRE ATT&CK framework, a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations, we break down potential attacks into smaller tactics and techniques. This allows us to methodically track risks, spot bad behavior before it causes damage and safely run drills to test if our defenses actually hold up.

Second, we develop a plan to deploy AI control mitigations to address these risks.

  • Identifying the problem (Detection): Monitoring is the backbone of AI control. We use other, trusted AI systems as "supervisors" to constantly review a working agent's reasoning, actions and plans to see if it is going off track.
  • Managing the problem (Prevention and Response): If a supervisor sees the agent trying to do something harmful, it can step in to block the action before damage occurs.

Finally, to ensure our systems are functional, we constantly measure performance across three critical metrics: coverage (the fraction of traffic monitored), recall (the fraction of misaligned behaviors caught), and time-to-response.

Scaling security as AI gets smarter

As AI models continue to advance, our defenses must also strengthen in tandem. In our AI Control Roadmap, we map security protocols to measurable milestones in AI capabilities on two critical fronts:

Ability to evade detection: Today, we monitor a model's visible chain-of-thought. However, as models learn to hide their reasoning â either through oversight awareness (knowing they are being monitored) or opaque reasoning (reasoning without using visible text) â simply reading their verbalized reasoning will not be enough. We will need to analyze their behavior and potentially inspect the modelâs inner workings.Ability for potential harm: For low-risk and reversible actions, we can rely on reviewing transcripts with a delay and remediating issues quickly after they occur (asynchronous response). For particularly high-risk actions, such as major cyber attacks, we must shift to real-time prevention, blocking harmful actions before they can be executed (synchronous response). In other words, our response strategy must scale with the severity of the threat.

© 2026 Now Let Us. All rights reserved.

Source: Google DeepMind Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

agentic-systems

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

A new study argues that current static leaderboards fail to predict how LLM agents perform in real-world deployments, proposing a new evaluation framework based on predictive validity.

NOW LET US Related – LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

agentic-systems

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

A recent study reveals that Large Language Models (LLMs) struggle to recognize their own knowledge limits when processing structured clinical tabular data. By comparing Qwen 2.5 7B with XGBoost, researchers identified critical epistemic blind spots and proposed a cross-model calibration method to address this limitation.

NOW LET US Related – AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

agentic-systems

AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

A new study traces a decade of progress in the intersection of AI and Systems Engineering (SE) across three developmental phases. By identifying five critical research gaps, the authors provide a roadmap for AI adoption, assurance, and workforce transformation in the field.

NOW LET US Related – Emergent Alignment

agentic-systems

Emergent Alignment

Researchers have introduced "Emergent Alignment," a method enabling Large Language Models (LLMs) to detect and self-correct their own unethical outputs. By integrating a "conscience step" and DPO optimization, this technique helps AI maintain ethical standards without relying on external judge models.

NOW LET US Related – Deontic Policies for Runtime Governance of Agentic AI Systems

agentic-systems

Deontic Policies for Runtime Governance of Agentic AI Systems

Autonomous agentic AI systems introduce novel security and compliance challenges that exceed the capabilities of current policy engines. To address this, researchers propose AgenticRei, a runtime governance framework utilizing deontic policies to strictly control AI behavior outside the LLM.

NOW LET US Related – Analyzing the Narration Gap in LLM-Solver Loops

agentic-systems

Analyzing the Narration Gap in LLM-Solver Loops

A new study highlights the 'narration gap' in hybrid LLM-solver systems, revealing that while the formal solver produces sound results, adversaries can still manipulate the LLM to invert the final answer via prompt injection.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.