Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Researchers have introduced Mask-Proof, an automated pipeline designed to evaluate the step-level mathematical reasoning of Large Language Models (LLMs). By converting real proofs into automatically checkable masked-step tasks, it bridges a critical gap in AI evaluation with expert-level accuracy.

Computer Science > Artificial Intelligence

Title: Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Abstract: Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available publicly.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Researchers introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data to improve time-to-event prediction. The study demonstrates that task-aware multimodal alignment is essential for robust generalization and scalable clinical deployment.

agentic-systems

OSGuard: A Benchmark for Safety in Computer-Use Agents

Researchers introduce OSGuard, a dual-granularity benchmark designed to evaluate the safety of computer-use AI agents, exposing the gap between local action safety and end-to-end execution.

NOW LET US Related – VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

agentic-systems

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

Researchers demonstrate how the VGPT-RSI AI system can achieve verified, formal progress on tasks adjacent to the Riemann Hypothesis while explicitly mapping out remaining mathematical bottlenecks.

agentic-systems

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

Researchers have proposed CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. It significantly improves end-to-end throughput and reduces communication overhead while maintaining high answer quality.

agentic-systems

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

Researchers propose DAG-SHAP, a novel feature attribution method based on edge intervention in directed acyclic graphs, overcoming the limitations of traditional node-centric approaches to improve AI explainability.

agentic-systems

Attribute Inference from Interactive Targeted Ads

A new study models how interactive targeted advertising can act as a channel for attribute inference, allowing advertisers to deduce sensitive user data. The researchers propose defense mechanisms like aggregate reporting and randomized disclosure to mitigate these privacy risks.

EXPLORE TOPICS