NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

Share
NOW LET US Article – PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

The paper introduces PHREEQC-MCQ-200, a benchmark designed to evaluate tool-augmented LLM agents on complex aqueous-geochemistry simulations. It highlights that while tool access improves overall accuracy, it also introduces unexpected regressions and highlights the importance of output-access protocols.

Computer Science > Artificial Intelligence

Title:PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

View PDF HTML (experimental)Abstract:Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a benchmark for evaluating tool-augmented agents on deterministic aqueous-geochemistry simulations. The benchmark contains 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to construct simulator inputs, execute PHREEQC, inspect structured outputs, and commit to final answers.

Across multiple frontier and mid-tier model families, simulator access substantially improves aggregate accuracy, confirming that grounded execution is necessary for many scientific-computation tasks. However, the gains are not monotonic: tool-augmented agents also lose items they answered correctly without tools, revealing regressions that average accuracy alone hides. We further show that output-access protocol matters. A table-of-contents interface can reduce token cost while preserving or improving accuracy for stronger models, but it degrades performance for mid-tier models that cannot reliably navigate structured simulator outputs.

PHREEQC-MCQ-200 therefore frames scientific tool use as an end-to-end diagnostic problem rather than a simple tool-calling capability. We argue that evaluations of scientific agents should report not only accuracy, but also item-level retention, output-access sensitivity, trajectory failures, and where the computation chain breaks.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

agentic-systems

Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

Researchers have proposed a new constrained, verifiable agent framework that shifts LLM output from free-form code to typed JSON configurations, addressing common web scraping errors. This approach minimizes operational costs by using zero LLM tokens during execution while ensuring high reusability.

NOW LET US Related – Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

agentic-systems

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

Researchers introduce Mnemosyne, an open-source runtime utilizing Agentic Transaction Processing (ATP) to validate and repair AI-generated workflows, ensuring system correctness and safety against untrusted proposals from Large Language Models (LLMs).

NOW LET US Related – HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

agentic-systems

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Researchers have introduced HARC, a new fine-tuning method that enhances the safety of Large Language Models (LLMs). By coupling "harmfulness" and "refusal" directions within the model's internal representations, HARC effectively prevents jailbreak attacks without degrading general performance.

NOW LET US Related – Solution space path planning for supporting en-route air traffic control

agentic-systems

Solution space path planning for supporting en-route air traffic control

Researchers have developed a novel solution-space path-planning algorithm designed to support en-route air traffic controllers by aligning with human decision logic. The algorithm achieves conflict-free path generation in just 3.69 milliseconds, significantly improving computational efficiency and operational safety.

NOW LET US Related – AGI Maze as a Benchmark Framework for World-Modeling Agents

agentic-systems

AGI Maze as a Benchmark Framework for World-Modeling Agents

A new research paper introduces AGI Maze, a benchmark framework designed to evaluate how AI agents build and manipulate internal world models. Initial evaluations show that even powerful LLMs struggle to solve simple mazes that humans can easily navigate.

NOW LET US Related – AI Native Games: A Survey and Roadmap

agentic-systems

AI Native Games: A Survey and Roadmap

A new research paper defines 'AI-native games' where generative AI is core to the gameplay loop, analyzing 53 projects to map out a development roadmap for this emerging sector.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.