NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

Agents' Last Exam

Share
NOW LET US Article – Agents' Last Exam

A new benchmark called "Agents' Last Exam" (ALE) has been introduced to evaluate AI agents on long-horizon, economically valuable, real-world tasks, revealing that current models achieve an average pass rate of just 2.6% on the hardest tier.

Computer Science > Artificial Intelligence

Title:Agents' Last Exam

View PDF HTML (experimental)Abstract:Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – A Motivational Architecture for Conversational AGI

agentic-systems

A Motivational Architecture for Conversational AGI

This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate.

NOW LET US Related – Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

agentic-systems

Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

A new study analyzing 403 hyperscale data centers in the US reveals that the AI boom is driving electricity consumption and carbon emissions to alarming levels, with their carbon intensity averaging 48% higher than the national grid average.

NOW LET US Related – An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

agentic-systems

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

Researchers have developed an interpretable and trustworthy AI framework to study the relationship between knee joint structural abnormalities and pain progression. By combining deep learning with advanced statistical modeling, this framework significantly improves prediction accuracy and clinical reliability.

NOW LET US Related – Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

agentic-systems

Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Researchers introduce Brick-Composer, a learning framework that equips multimodal large language models (MLLMs) with spatial reasoning and visual grounding capabilities for brick assembly, significantly improving their construction accuracy.

NOW LET US Related – Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

agentic-systems

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

A new study compared three state-of-the-art LLMs (GPT-4o, Claude Sonnet, and Llama 3.1) against ten medical specialists in summarizing clinical literature. While expert-written summaries remain preferred, the study reveals that distinguishing between human- and AI-generated medical content is becoming increasingly difficult.

NOW LET US Related – Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

agentic-systems

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Researchers have proposed SAGE-PTQ, a novel ultra-low-bit post-training quantization framework for LLMs that minimizes hidden scaling overhead. It significantly reduces GPU memory usage and accelerates decoding speed while maintaining high accuracy compared to existing methods.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.