AGENTIC-SYSTEMSApril 20, 20261 min read7 views

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

SocialGrid is a new embodied multi-agent environment inspired by 'Among Us' that evaluates LLM agents on planning and social reasoning, revealing significant gaps in deception detection and complex task execution.

Computer Science > Artificial Intelligence

View PDF HTML (experimental)Abstract:As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

A Critical Analysis of Trustworthy AI Tools, Mark Frameworks, and the Implementation Chasms

A critical analysis of OECD data reveals significant gaps in operationalizing Trustworthy AI (TAI) frameworks, showing that current efforts are fragmented and heavily focused on post-development stages while neglecting early design and sustainability.

agentic-systems

Do Coding Agents Need Executable World Models, Simplification, and Verification to Solve ARC-AGI-3?

A new study evaluates whether coding agents need executable world models, simplification, and verification to solve the challenging ARC-AGI-3 benchmark, revealing that a complete verification treatment combined with advanced GPT models can fully solve the public dataset.

agentic-systems

Precise but Uncoupled: Reviewer Precision Does Not Guarantee Critique Uptake in Multi-Agent Math Reasoning

A new study reveals that in multi-agent AI systems, a reviewer's ability to detect errors is empirically decoupled from the system's ability to adopt those critiques. Surprisingly, broadcast-style peer discussion outperforms hierarchical pipelines in complex math reasoning, despite the latter having more precise reviewers.

agentic-systems

AnovaX: A Local, Multi-Agent Voice Assistant with LLM Planning, Typed Executors, and Adaptive Recovery

AnovaX is a local-first desktop voice assistant that uses an LLM planner and a multi-agent orchestrator to execute complex tasks directly on the user's machine. It features adaptive recovery, specialized agent classes, and a remote control interface, demonstrating powerful local automation with minimal code.

agentic-systems

Cura 1T: Specialized Model for Agentic Healthcare

Researchers have introduced Cura 1T, a specialized large language model for healthcare trained through a human-gated self-evolution loop. The model excels at complex tasks ranging from patient consultation and multimodal clinical reasoning to utilizing electronic health records.

agentic-systems

GraphDx: A Cost-Aware Knowledge-Enhanced Multi-Agent Framework for Sequential Diagnosis

Researchers have introduced GraphDx, a novel multi-agent framework that balances diagnostic accuracy and resource costs in clinical settings. By combining knowledge graphs with LLMs, GraphDx improves diagnostic success rates up to 93% while cutting testing costs by up to 54%.