AGI Maze as a Benchmark Framework for World-Modeling Agents

A new research paper introduces AGI Maze, a benchmark framework designed to evaluate how AI agents build and manipulate internal world models. Initial evaluations show that even powerful LLMs struggle to solve simple mazes that humans can easily navigate.

Computer Science > Artificial Intelligence

Title:AGI Maze as a Benchmark Framework for World-Modeling Agents

View PDFAbstract:Large language models (LLMs) are powerful pattern-completion systems, but their default operating mode - predicting the next token from a static context - does not reliably produce persistent, manipulable representations of an external world. Many tasks that look like "reasoning" in text become substantially harder once the environment is partially observable, stateful, and requires memory and structured hypotheses about hidden state. AGI Maze is a lightweight framework for building such environments without requiring high-dimensional sensory inputs. It provides a family of grid-based maze tasks with a clean API and multiple difficulty regimes. The goal is to create benchmarks where agents must learn and use world state representations, not just infer a local rule over readily provided observations. We provide an initial evaluation of several vanilla LLMs on simple mazes showing that they fail to represent mazes internally at LLM inference time. We also introduce a baseline agent, which is allowed to use its message history as a working memory to construct descriptions of observations at agentic runtime. Although this can improve performance, it is still insufficient for an LLM agent to reliably solve even small mazes within a step budget that is more than enough for humans.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Researchers have introduced HARC, a new fine-tuning method that enhances the safety of Large Language Models (LLMs). By coupling "harmfulness" and "refusal" directions within the model's internal representations, HARC effectively prevents jailbreak attacks without degrading general performance.

agentic-systems

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

Researchers introduce Mnemosyne, an open-source runtime utilizing Agentic Transaction Processing (ATP) to validate and repair AI-generated workflows, ensuring system correctness and safety against untrusted proposals from Large Language Models (LLMs).

agentic-systems

Solution space path planning for supporting en-route air traffic control

Researchers have developed a novel solution-space path-planning algorithm designed to support en-route air traffic controllers by aligning with human decision logic. The algorithm achieves conflict-free path generation in just 3.69 milliseconds, significantly improving computational efficiency and operational safety.

agentic-systems

AI Native Games: A Survey and Roadmap

A new research paper defines 'AI-native games' where generative AI is core to the gameplay loop, analyzing 53 projects to map out a development roadmap for this emerging sector.

agentic-systems

Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Context-Aware Agricultural Advisory Generation

Agri-SAGE is a closed-loop framework that integrates multi-agent LLM reasoning with APSIM biophysical simulation to generate and validate highly accurate, context-aware agricultural advisories.

agentic-systems

Coachable agents for interactive gameplay

Researchers have developed a new framework that allows real-time control over the styles and behaviors of AI agents. The framework has been successfully demonstrated in AAA games like Horizon Forbidden West and Gran Turismo.

EXPLORE TOPICS