AGENTIC-SYSTEMSMarch 20, 20261 min read35 views

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

ProRL Agent introduces a scalable Rollout-as-a-Service infrastructure that decouples rollout orchestration from the RL training loop, supporting complex multi-turn LLM agents across diverse tasks.

Computer Science > Artificial Intelligence

Title:ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

View PDF HTML (experimental)Abstract:Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Probabilistic Concept-Aware Steering for Trustworthy LLM Inference

Researchers have introduced Probabilistic Concept-Aware Steering (PCS), an inference-time intervention framework for LLMs. PCS provides fine-grained, safety-oriented semantic steering while preserving original task competence.

agentic-systems

MUX: Continuous Reasoning via Multiplexed Tokens

Researchers introduced MUX, a novel latent reasoning method that compresses discrete text-based reasoning steps into continuous multiplexed tokens. By enabling lossless superposition, MUX significantly boosts LLM reasoning efficiency and speed across complex problem-solving tasks.

agentic-systems

ProbSPARQL: Querying Knowledge Graphs with Multi-dimensional, Uncertain Numeric Data

Researchers have introduced ProbSPARQL, an upward-compatible SPARQL extension designed to query multi-dimensional and uncertain numeric measurement data within Knowledge Graphs, providing significant query performance gains for complex industrial applications.

agentic-systems

Position: AI/ML Deepfake Research is Misaligned with AI-Generated Non-Consensual Intimate Imagery (AIG-NCII)

Current AI/ML research on deepfakes focuses primarily on epistemic harms like fake news and scams, leaving a dangerous gap in addressing AI-generated non-consensual intimate imagery (AIG-NCII) and its dignity harms to victims.

NOW LET US Related – Cross-Dialect Generalization Without Retraining: Benchmarks and Evaluation of Schema-Derived Constrained Decoding for MLIR

agentic-systems

Cross-Dialect Generalization Without Retraining: Benchmarks and Evaluation of Schema-Derived Constrained Decoding for MLIR

Researchers introduced schema-derived constrained decoding for MLIR, enabling a 1.7B small language model to match or beat 15B-34B LLMs without retraining, while generating code 8x-25x faster.

agentic-systems

Beyond Accuracy and Cost: Latency-Aware LLM Query Routing for Dynamic Workloads

Researchers introduced a novel latency-aware LLM query router that jointly optimizes generation latency, accuracy, and cost. The framework achieves up to a 40% improvement in accuracy-cost utility while maintaining low latency under dynamic workloads.