NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Share
NOW LET US Article – Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

A new research introduces Procedural Memory Distillation (PMD), allowing large language models to learn from past rollouts to self-improve. PMD significantly boosts model performance on coding and scientific benchmarks without adding computational overhead during inference.

Computer Science > Artificial Intelligence

Title:Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

View PDFAbstract:Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local updates cannot capture: which strategies consistently pass verification, which failure modes persist, which patterns recur. We propose Procedural Memory Distillation (PMD), which converts these crossepisode signals into reusable procedural memory and distills it into the policy's weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels of abstraction: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems, all extracted online from the model's own trajectories. A memory-conditioned self-teacher draws on the accumulated experience to supervise the student on its own rollouts, enabling student to progressively internalize procedural knowledge within its parameters. The central design principle is co-evolution: the policy generates rollouts that update the memory, and memory shapes the supervision that updates the policy. Empirically, across Qwen3-8B and OLMo3-Instruct-7B, PMD improves over SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH. Co-evolution powers these gains: freezing either the memory or the policy trails PMD by more than 10% across SCIKNOWEVAL domains.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Auto-FL-Research: Agentic Search for Federated Learning Algorithms

agentic-systems

Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Researchers introduce Auto-FL-Research (AFR), a constrained coding-agent workflow designed to automate the search and optimization of Federated Learning algorithms. This approach addresses the costly manual trial-and-error process, paving the way for more efficient decentralized AI development.

NOW LET US Related – PACE: A Neuro-Symbolic Framework for Plausible and Actionable Counterfactual Explanations

agentic-systems

PACE: A Neuro-Symbolic Framework for Plausible and Actionable Counterfactual Explanations

Researchers have introduced PACE, a novel neuro-symbolic framework designed to generate plausible and actionable counterfactual explanations for machine learning models. By separating neural prediction from symbolic reasoning, PACE successfully addresses the limitations of traditional methods that often produce unrealistic recommendations.

NOW LET US Related – CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse

agentic-systems

CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse

Researchers introduce CreativityNeuro, a data-free method that enhances divergent thinking in LLMs via contrastive weight steering, significantly reducing mode collapse and improving creative performance without re-training.

NOW LET US Related – The Wiola Architecture for Efficient Small Language Models

agentic-systems

The Wiola Architecture for Efficient Small Language Models

Researchers have introduced Wiola, a novel Small Language Model (SLM) architecture built from first principles without inheriting from GPT or LLaMA. Featuring five core technological innovations, Wiola promises superior performance and optimized computational efficiency for small-scale AI models.

NOW LET US Related – Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

agentic-systems

Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

Researchers have proposed a new constrained, verifiable agent framework that shifts LLM output from free-form code to typed JSON configurations, addressing common web scraping errors. This approach minimizes operational costs by using zero LLM tokens during execution while ensuring high reusability.

NOW LET US Related – Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

agentic-systems

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

Researchers introduce Mnemosyne, an open-source runtime utilizing Agentic Transaction Processing (ATP) to validate and repair AI-generated workflows, ensuring system correctness and safety against untrusted proposals from Large Language Models (LLMs).

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.