OSGuard: A Benchmark for Safety in Computer-Use Agents

Researchers introduce OSGuard, a dual-granularity benchmark designed to evaluate the safety of computer-use AI agents, exposing the gap between local action safety and end-to-end execution.

Computer Science > Artificial Intelligence

Title:OSGuard: A Benchmark for Safety in Computer-Use Agents

View PDF HTML (experimental)Abstract:Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions. OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. The action-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state. The execution suite contains manually constructed OSWorld-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Attribute Inference from Interactive Targeted Ads

A new study models how interactive targeted advertising can act as a channel for attribute inference, allowing advertisers to deduce sensitive user data. The researchers propose defense mechanisms like aggregate reporting and randomized disclosure to mitigate these privacy risks.

agentic-systems

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Researchers have developed Metric Match, a novel method to estimate the reliability of LLM judges using limited human annotations. By selecting an optimal subset of samples, it reduces annotation needs by 32.5% and significantly cuts down evaluation costs.

agentic-systems

Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

Researchers have introduced SERAF, a novel multimodal time series forecasting framework that addresses the limitations of traditional methods under non-stationarity by leveraging dual retrieval over both numerical data and self-generated textual descriptions.

agentic-systems

AI Engram: In Search of Memory Traces in Artificial Intelligence

Researchers introduce 'AI Engram', a geometric framework to identify and isolate individual memory traces in deep neural networks. This biologically-inspired approach enables surgical manipulation of learned knowledge through simple linear arithmetic without iterative optimization.

agentic-systems

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

Researchers have presented an LLM-driven framework that simplifies the retrieval of remote sensing data from cloud-based geospatial catalogs using natural language. The system integrates three specialized AI agents to optimize performance and mitigate adversarial API manipulation risks.

agentic-systems

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Researchers introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data to improve time-to-event prediction. The study demonstrates that task-aware multimodal alignment is essential for robust generalization and scalable clinical deployment.

EXPLORE TOPICS