A case study in testing with 100+ Claude agents in parallel

A detailed look at how the mngr tool uses parallel Claude agents to run, debug, and improve its own codebase through automated end-to-end testing.

In our previous blog post, we introduced mngr and how you can use it to usefully launch hundreds of parallel agents. Here’s all the details of how we are actually using mngr to run and improve itself, by testing its own demo script.

High-level architecture

This is how the entire setup works:

We start from a tutorial script, tutorial.sh, containing blocks of commands. A block is simply a sequence of consecutive non-empty lines.
For each block, we derive one or more pytest function.
For each pytest function, we launch an agent to run, debug, fix and improve it.
Finally, we integrate the outcome of all the agents together.

Let’s dive into how each step works.

Writing the tutorial script

This script is seeded with a lot of content we wrote ourselves, but it is a bit tiring once we have written 50 or so examples. So we simply:

Write some comments in the file, like # Managing snapshots
Ask a coding agent to fill in the blank.
Review and keep the ones that we like.

Since we already have a lot of documentation elsewhere in the codebase—in particular, we have auto-generated man pages committed into the git repo, like this one—the agent does a good job of generating examples!

In fact, even when it doesn’t, it’s still useful: that means our interface is too confusing or was not properly documented, and a human may have problems figuring out how to use it too. We then used that signal to refine mngr’s interface to be as simple as possible.

Asking agents to generate examples turned out to be a win-win situation–we either get good examples, or get bad examples and use that to improve mngr itself!

Converting tutorial blocks to pytest functions

Now that we have a healthy amount of tutorial commands, we can ask a coding agent to convert it into pytest functions. There are a few details worth mentioning.

Tutorial blocks tend to be concise and sometimes a bit contrived, but we want tests to be more exhaustive, covering both happy and unhappy paths. So this is a 1:N correspondence: for the same tutorial block, we can expect slight variations of the command itself or the environment to result in different outcomes, and they often deserve separate test cases.
In order to preserve the correspondence between tutorial blocks and test functions, we also ask the agent to “declare” which tutorial block it corresponds to, by citing the tutorial block in the function it generates, using a specific API in the test fixture.
In order to keep the agent honest, we also use a simple script to check that it’s indeed the case - for each tutorial block, there’s at least one pytest function that cites the tutorial block.

All of these are packaged into a slash command sync-tutorial-to-e2e-tests.

The coding agent usually can’t do a very good job of writing end-to-end tests in this step, and that’s totally expected. End-to-end tests are difficult to write for humans because there are fundamental tensions in all three stages of tests, and the reasons equally hold for coding agents:

Arrange: You want to set up as little as possible to reflect real-world usage scenarios, but you need an appropriate amount of setup to make tests appropriately isolated. Act: You want to be as faithful to real-world commands as possible—in this case, the commands from the tutorial—but you often need some variation for the commands to be suitable for testing. Assert: You want to test the effect of the commands as closely as possible, but testing e.g. file contents too literally can end up with fragile or flaky tests.

But it’s okay if the coding agent doesn’t do a good job at this stage! We’ll solve this in the next step, but let’s mention a few things about our test framework.

The test framework

The great thing about running CLI commands is that Python (and any other programming language, really) already has an API for it: the subprocess module. Give it a command, and you can get the stdout, stderr and exit code.

Still, we built some utilities (really just a thin layer on top of subprocess) so that the test functions can be a little bit more concise and carry a little bit more information. A test function looks like this:

def test_help_succeeds(e2e: E2eSession) -> None: e2e.write_tutorial_block("""# or see the other commands--list, destroy, message, connect, push, pull, clone, and more! These other commands are covered in their own sections below. mngr --help""") result = e2e.run("mngr --help", comment="or see the other commands--list, destroy, message, connect, push, pull, clone, and more!",) expect(result).to_succeed() expect(result.stdout).to_contain("Usage") expect(result.stdout).to_contain("create") expect(result.stdout).to_contain("list")

Building this extra layer also allows us to generate transcripts for the commands, which looks like:

or see the other commands--list, destroy, message, connect, push, pull, clone, and more!

$ mngr --help any output to stdout! any errors to stderr

Finally, remember that mngr runs your agent in a tmux session, and tmux can’t be so easily captured as simple CLI transcripts. But fear not: mngr allows us to define a custom “connect command”, and in our test setup, we redirect it to a script by writing the following config:

[commands.create] connect_command = "mngr-e2e-connect"

The mngr-e2e-connect script, in turn, uses asciinema to attach to the agent, and saves the recording in the test output directory.

We also built a combined view of all the artifacts–you can see the CLI transcript and TUI recordings on a web page:

Orchestrating the tests

Now that we have all those tests, let’s run them! Here’s the plan:

Collect all the test names using pytest --collect-only
For each test, launch an agent to work on it. This means several things: a. If the test is failing, either fix the test code or the code it’s testing b. If the test is passing, think about how to improve the test itself: make it more faithful to the original tutorial block, make the assertions more realistic, create additional tests etc. c. In any case, we instruct the agent to write a result JSON file.
Wait for them to finish, pulling their result JSON files and test artifacts.
Collect all the code changes they made, and use an agent to merge them all together into one mega PR.

Aside from step 1, every other step is implemented using mngr’s primitives:

Use the mngr create primitive to launch testing agents, which also allows us to send an initial prompt.
Use the mngr list primitive to poll the state of the agents.
When an agent is done, use the mngr pull primitive to pull down the result files and test artifacts, and then the mngr stop primitive to stop it.
Finally, use the mngr create primitive again to create the “integrator” agent that merges all the changes together.

Integrating changes from all the agents

Integrating changes from many agents is not a trivial task, even for an agent. We spent a lot of time thinking about and iterating on this, and this is eventually what we ended up with:

Each of the testing agents would divide its commits into implementation fixes, and non-implementation (fixing the test or just making it better, fixing some doc, etc.)
When the integrator sees the results from all the testing agents, it just merges all the non-implementation fixes together - since these are usually uncontroversial.
For all the implementation fixes, we instruct the integrator to rank them by importance, and keep them as distinct commits, but merge them into a single linear branch, resolving conflicts along the way.

This results in a single PR that can be easily reviewed by a human: the non-implementation fixes usually can be merged as is, and the implementation fixes can be reviewed one by one, with undesirable ones reverted.

Source: Hacker News