Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

A new research introduces Procedural Memory Distillation (PMD), allowing large language models to learn from past rollouts to self-improve. PMD significantly boosts model performance on coding and scientific benchmarks without adding computational overhead during inference.

Computer Science > Artificial Intelligence

Title:Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

View PDFAbstract:Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local updates cannot capture: which strategies consistently pass verification, which failure modes persist, which patterns recur. We propose Procedural Memory Distillation (PMD), which converts these crossepisode signals into reusable procedural memory and distills it into the policy's weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels of abstraction: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems, all extracted online from the model's own trajectories. A memory-conditioned self-teacher draws on the accumulated experience to supervise the student on its own rollouts, enabling student to progressively internalize procedural knowledge within its parameters. The central design principle is co-evolution: the policy generates rollouts that update the memory, and memory shapes the supervision that updates the policy. Empirically, across Qwen3-8B and OLMo3-Instruct-7B, PMD improves over SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH. Co-evolution powers these gains: freezing either the memory or the policy trails PMD by more than 10% across SCIKNOWEVAL domains.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Computer Science > Artificial Intelligence

Title:Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

More in this category

Auto-FL-Research: Agentic Search for Federated Learning Algorithms

PACE: A Neuro-Symbolic Framework for Plausible and Actionable Counterfactual Explanations

CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse

The Wiola Architecture for Efficient Small Language Models

Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

Discover All Categories