An FAQ on Reinforcement Learning Environments

An exploration of how reinforcement learning environments have become a critical bottleneck and a billion-dollar market in the training of frontier AI models.

This post is a collaboration between guest author Chris Barber and JS Denain from Epoch AI.

Reinforcement learning (RL) environments have become central to how frontier AI labs train their models. In September 2025, The Information reported that Anthropic had discussed spending over $1 billion on RL environments over the following year. As Andrej Karpathy put it in his 2025 year-in-review: by training LLMs on a wide range of verifiable tasks across different environments, “the LLMs spontaneously develop strategies that look like ‘reasoning’ to humans.”

This wave of RL for capabilities started with OpenAI’s o1, which was trained on math and coding problems with verifiable answers. Since then, labs have expanded the range of tasks they train on, all the while scaling the amount of compute spent on RL training.

Without diverse, high-quality environments and tasks to train on, throwing more compute at RL risks wasting much of it. As a result, creating those tasks and environments has become a key bottleneck for scaling capabilities, and a growing market that remains largely behind closed doors.

To understand the emerging industry of building environments and tasks that labs use to RL-train their models, we interviewed 18 people across RL environment startups, neolabs, and frontier labs. We asked them what RL environments and tasks look like, how labs use them, what makes a good one, and where the field is headed.

Main takeaways:

Enterprise workflows are a major growth area. Math and coding tasks came first, but we’re now seeing significant growth in enterprise workflows: tasks like navigating Salesforce, filing reports, or manipulating spreadsheets.
Reward hacking is a top concern. Interviewees consistently cited robustness against reward hacking as a key quality criterion. Models find ways to game graders, and preventing this requires extensive iteration on both environments and tasks.
Scaling without sacrificing quality is hard. A major challenge is scaling the quantity of environments and tasks without sacrificing quality. The hard parts are management (coordinating a growing number of task builders) and maintaining good quality assessment processes.

What are RL environments and tasks?

In modern reinforcement learning for language models, the model is given a task to accomplish and a set of actions it can take. The model attempts the task, and a grader (typically automated, such as a unit test or an LLM judging against a rubric) assigns a score to its attempts. These scored attempts are then used to update the model’s weights, reinforcing successful strategies.

The RL environment is defined by the set of actions the model can take (running code, thinking out loud, clicking buttons, searching documents) and the surrounding context that determines the effect of these actions (environment variables, file systems, the state of a simulated application). In practice, the environment is often delivered as a Docker container.

Each task consists of a prompt instructing the model to achieve an objective, and a grader that determines whether (or to what extent) the objective was met. Terminology in this space isn’t fully standardized, and the boundary between “environment” and “task” is somewhat fuzzy. In this piece we discuss both environments and tasks, since they’re often built and sold together.

Here are some examples of environments and the kinds of tasks they could support:

A git repository: With tasks like fixing a bug so that unit tests pass, similar to benchmarks like SWE-bench Verified. The task specifies a git repository at a specific commit with a failing test suite; the environment provides the operating system and tools to interact with the repo; the grader runs the tests and checks that they pass.
An Airbnb clone: With tasks like finding the cheapest two-bedroom listing in a given city for specific dates. The environment is a simulated website with realistic listings, prices, and filters; the agent sees a structured representation of the page (like a DOM) and outputs actions like clicking elements or typing into fields. The grader verifies the final answer.
A Bloomberg terminal clone: With tasks like finding the 5-year compound annual growth rate for a list of companies. The environment simulates the terminal’s interface and data; the grader checks whether the returned figures match the correct values.
An Excel clone: With tasks like creating a pivot table showing revenue by region from a raw dataset. The environment provides a spreadsheet application with realistic functionality; the grader compares the output against a reference solution.

For computer use environments like the Excel clone, a single environment might support hundreds of different tasks. For coding environments, it’s more common for each environment to contain just one task, since setting up a repo state is relatively cheap.

How are RL environments used by labs?

Each environment and task can be used in three main ways: for reinforcement learning, for benchmarking, or for supervised fine-tuning on trajectories that solve the task.

Reinforcement learning remains the primary use case. As one RL environment startup employee put it: “RL is the main use. We have some requests for creating envs for benchmarking. I’d say perhaps 10-20x more the former vs the latter.” One difference is that benchmarks are typically built for single-turn evaluation, whereas there’s growing interest in RL environments that capture multi-turn interactions between agent and user.

Environments are also used to generate data for supervised learning, by using successful RL trajectories as training examples during midtraining. One interviewee noted: “While it might not drive purchasing today, a well-designed environment can be used as an effective mechanism for synthetic data generation. I feel this will be increasingly important as env development matures and designers target this use-case.”

An interviewee noted that supervised fine-tuning (SFT) had been growing especially for interleaved thinking and tool calling. With SFT, you can choose a single good trajectory and train on that, whereas RL requires multiple trajectories with enough differentiation between them to provide a learning signal. This makes SFT more practical when it’s relatively easy to produce good trajectories but hard to get a reliable grader or enough variation between attempts.

Which companies build RL Environments?

A growing number of specialized startups focus specifically on building RL environments. They cover a range of domains, from software engineering tasks to computer use to math and finance.

Traditional data providers like Mercor, Surge, Handshake, and Turing, who used to primarily provide human-labeled data, now also sell RL environments. Part of what you pay for is their QA processes and supervision infrastructure, but as one founder put it, the main value add is “they have the guys”: if you need to scale up task creation quickly, they can staff a project faster than you could hire in-house.

In-house teams at model developers are also building environments. This includes both frontier labs...

Source: Hacker News