NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Share
NOW LET US Article – Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Current reinforcement learning methods for LLMs often struggle to distinguish between genuine reasoning and memorized shortcuts. To address this, researchers propose DiRL, a novel framework that guides exploration toward true reasoning.

Computer Science > Artificial Intelligence

Title:Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

View PDF HTML (experimental)Abstract:Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Introducing Gemma 4 12B: a unified, encoder-free multimodal model

agentic-systems

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google introduces Gemma 4 12B, a unified, encoder-free multimodal model designed to run agentic workflows locally on laptops with just 16GB of RAM.

NOW LET US Related – Measuring the impact of learning with AI in Sierra Leone and beyond

agentic-systems

Measuring the impact of learning with AI in Sierra Leone and beyond

A real-world trial in Sierra Leone demonstrates that Gemini-powered Guided Learning significantly boosts math scores and fosters critical thinking. The study highlights AI's role as a powerful pedagogical partner that augments, rather than replaces, teachers.

NOW LET US Related – CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

agentic-systems

CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

Researchers have introduced CARVE-Q, a breakthrough architecture combining quantum search with classical verification to solve interactive driving repair for autonomous vehicles. This system enables autonomous vehicles to make rapid emergency maneuver repairs using quantum algorithms while ensuring absolute safety through classical verification certificates.

NOW LET US Related – DiBS: Diffusion-Informed Branch Selection

agentic-systems

DiBS: Diffusion-Informed Branch Selection

Researchers have introduced DiBS, a novel approach that guides symbolic solvers with diffusion models to solve complex Sudoku puzzles. This method significantly reduces search costs and backtracks on challenging instances while maintaining strict correctness guarantees.

NOW LET US Related – SafeGene: Reusable Adapters for Transferable Safety Alignment

agentic-systems

SafeGene: Reusable Adapters for Transferable Safety Alignment

Fine-tuning open-weight LLMs often inadvertently degrades their safety alignment, making them vulnerable to malicious prompts. SafeGene addresses this by introducing a reusable safety-adapter module that restores safety across various downstream tasks without compromising model performance.

NOW LET US Related – Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

agentic-systems

Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

Researchers propose a novel framework that treats fairness in machine learning as a symmetry operation, mitigating bias by over 90% with minimal impact on accuracy.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.