NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Share
NOW LET US Article – Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Researchers have introduced Mask-Proof, an automated pipeline designed to evaluate the step-level mathematical reasoning of Large Language Models (LLMs). By converting real proofs into automatically checkable masked-step tasks, it bridges a critical gap in AI evaluation with expert-level accuracy.

Computer Science > Artificial Intelligence

Title: Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Abstract: Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available publicly.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

agentic-systems

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Researchers introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data to improve time-to-event prediction. The study demonstrates that task-aware multimodal alignment is essential for robust generalization and scalable clinical deployment.

NOW LET US Related – OSGuard: A Benchmark for Safety in Computer-Use Agents

agentic-systems

OSGuard: A Benchmark for Safety in Computer-Use Agents

Researchers introduce OSGuard, a dual-granularity benchmark designed to evaluate the safety of computer-use AI agents, exposing the gap between local action safety and end-to-end execution.

NOW LET US Related – VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

agentic-systems

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

Researchers demonstrate how the VGPT-RSI AI system can achieve verified, formal progress on tasks adjacent to the Riemann Hypothesis while explicitly mapping out remaining mathematical bottlenecks.

NOW LET US Related – CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

agentic-systems

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

Researchers have proposed CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. It significantly improves end-to-end throughput and reduces communication overhead while maintaining high answer quality.

NOW LET US Related – Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

agentic-systems

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

Researchers propose DAG-SHAP, a novel feature attribution method based on edge intervention in directed acyclic graphs, overcoming the limitations of traditional node-centric approaches to improve AI explainability.

NOW LET US Related – Attribute Inference from Interactive Targeted Ads

agentic-systems

Attribute Inference from Interactive Targeted Ads

A new study models how interactive targeted advertising can act as a channel for attribute inference, allowing advertisers to deduce sensitive user data. The researchers propose defense mechanisms like aggregate reporting and randomized disclosure to mitigate these privacy risks.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.