Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

A new study reveals that successfully compiling AI-generated Lean statements does not guarantee semantic accuracy, exposing a significant gap and proposing a more rigorous evaluation framework.

Computer Science > Artificial Intelligence

Title:Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

View PDFAbstract:Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean declaration may type-check while omitting hypotheses, changing domains, or expressing a vacuous claim. We study faithful statement formalization as both an evaluation problem and a bottleneck-attribution problem. On a 400-entry graduate-level benchmark spanning real analysis, complex analysis, topology, and algebra, our protocol combines Lean compilation, cross-model semantic judging, and human expert calibration. The resulting picture is different from compile-rate evaluation: a full tool-augmented agent reaches 89.5% compilation but only 60.5% consensus faithfulness, exposing a 29.0-point compile-pass but consensus-unfaithful gap. Targeted human audits support the metric as a conservative decision boundary: across available case-level audits, 96.0% of consensus-positive outputs are human-confirmed faithful, while 82.4% of compile-pass consensus-negative outputs are human-confirmed semantic failures. Under this metric, existing one-shot formalizer models and prover-oriented Lean models remain low, suggesting that formal validity, proof-oriented Lean competence, and faithful statement generation should be reported separately. We then use a full $2^3$ factorial design to decompose three recurring interventions in formalization pipelines: parametric expert drafting, Mathlib/context search, and Lean elaboration feedback. Elaboration feedback is the largest validity intervention, but it also exposes a larger compile-pass semantic-failure bucket; search mainly improves grounding and selectivity; and fine-tuned drafting is largely substitutable in this tool stack once feedback and grounding are available.

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

Computer Science > Artificial Intelligence

Title:Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

More in this category

Investigating Multi-Agent Deliberation in Law

HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

What Drives Interactive Improvement from Feedback?

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

Contrastive Reflection for Iterative Prompt Optimization

Discover All Categories