NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

Share
NOW LET US Article – MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

While Large Language Models (LLMs) have made significant progress, they still struggle with advanced mathematics. The newly introduced MA-ProofBench benchmark is designed to challenge AI's theorem-proving capabilities in mathematical analysis, exposing a wide gap between natural language reasoning and formal logic.

Computer Science > Artificial Intelligence

Title:MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

View PDF HTML (experimental)Abstract:Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

agentic-systems

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

Researchers have introduced "Adversarial Concept Search," a novel method that uses an LLM's representational geometry to predict which concept combinations it will fail on due to feature interference.

NOW LET US Related – History of the Muddy Children Puzzle

agentic-systems

History of the Muddy Children Puzzle

A recent study traces the two-century history of the "Muddy Children Puzzle", a classic problem that inspired the development of epistemic logic in AI. The paper also introduces unique variations and a novel self-referential puzzle.

NOW LET US Related – YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

agentic-systems

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

The research on YeasierAgent introduces a groundbreaking paradigm for application building, shifting from isolated chatbots to cohesive multi-agent computational environments. By bypassing fixed graphical layouts, it enables rapid creation of platform-agnostic agent-native applications.

NOW LET US Related – When Sample Selection Bias Precipitates Model Collapse

agentic-systems

When Sample Selection Bias Precipitates Model Collapse

Recursive training on synthetic data risks model collapse. While data selection is seen as a remedy, this paper shows that in low-resource, siloed environments, sample selection bias actually accelerates collapse, and proposes collaborative Wasserstein proxy references as a mitigation.

NOW LET US Related – Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

agentic-systems

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

While large language models can successfully close proof gaps in interactive theorem provers, a new case study reveals that AI-generated formalizations often fail expert reviews due to poor API design and structural issues. The researchers argue that autoformalization must be evaluated by human standards, not just compiled code.

NOW LET US Related – A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

agentic-systems

A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

Researchers have developed a multi-agent AI system that automates the processing of diverse high school transcripts, achieving 96.7% accuracy and reducing processing time to just 45 seconds per document.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.