Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Stepwise is a neuro-symbolic framework that combines LLMs with symbolic tools to automate formal verification, achieving a 77.6% success rate on the seL4 benchmark.

Computer Science > Artificial Intelligence

Title:Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

View PDF HTML (experimental)Abstract:Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Google DeepMind and A24 announce first-of-its-kind research partnership

Google DeepMind and A24 have announced a pioneering research partnership to integrate advanced AI into the filmmaking process. This collaboration aims to empower creators while helping DeepMind refine its tools based on real-world artistic feedback.

agentic-systems

Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Researchers introduce Auto-FL-Research (AFR), a constrained coding-agent workflow designed to automate the search and optimization of Federated Learning algorithms. This approach addresses the costly manual trial-and-error process, paving the way for more efficient decentralized AI development.

agentic-systems

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

A new research introduces Procedural Memory Distillation (PMD), allowing large language models to learn from past rollouts to self-improve. PMD significantly boosts model performance on coding and scientific benchmarks without adding computational overhead during inference.

agentic-systems

SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

Researchers have introduced SemHash-LLM, a multi-granularity semantic hashing framework designed for efficient large-scale document deduplication. By combining LLMs with advanced hashing techniques, it reduces neural verification costs to under 1% while maintaining high accuracy.

agentic-systems

OPINE-World: Programmatic World Modeling with Ontology-error-Prioritized Interactive Exploration

Researchers have introduced OPINE-World, a breakthrough LLM agent that learns an object-centric programmatic world model online through interaction. By guiding exploration with a novel 'ontology error' metric, it overcomes the data-hungry nature of traditional deep networks and achieves high efficiency on the ARC-AGI-3 benchmark.

agentic-systems

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Researchers propose Semi-CoT, a semi-supervised framework that leverages unlabeled questions to generate reliable pseudo-reasoning chains for training large language models.

EXPLORE TOPICS