How we build evals for Deep Agents

The best agent evals directly measure an agent behavior we care about. This article explores how to source data, create metrics, and run targeted experiments to make agents more accurate and reliable.

The best agent evals directly measure an agent behavior we care about. Here's how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more accurate and reliable.

TLDR:## Evals shape agent behavior

We’ve been curating evaluations to measure and improve Deep Agents. Deep Agents is an open source, model agnostic agent harness that powers products like Fleet and Open SWE. Evals define and shape agent behavior, which is why it’s so important to design them thoughtfully.

Every eval is a vector that shifts the behavior of your agentic system. For example, if an eval for efficient file reading fails, you’ll likely tweak the system prompt or the read_file tool description to nudge behavior until it passes. Every eval you keep applies pressure on the overall system over time.

It is crucial to be thoughtful when adding evals. It can be tempting to blindly add hundreds (or thousands) of tests. This leads to an illusion of “improving your agent” by scoring well on an eval suite that may not accurately reflect behaviors you care about in production.

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

When building Deep Agents, we catalog the behaviors that matter in production, such as retrieving content across multiple files in the filesystem or accurately composing 5+ tool calls in sequence. Rather than using benchmark tasks in aggregate, we take the following approach to eval curation:

Decide which behaviors we want our agent to follow. Then research and curate targeted evals that measure those behaviors in a verifiable way.
For each eval, add a docstring that explains how it measures an agent capability. This ensures each eval is self-documenting. We also tag each eval with categories like tool_use to enable grouped runs.
Review output traces to understand failure modes and update eval coverage.

Because we trace every eval run to a shared LangSmith project, anyone on the team can jump in to analyze issues, make fixes, and reassess the value of a given eval. This creates shared responsibility for adding and maintaining good evals. Running many models across many evals can also get expensive, so targeted evals save money while improving your agent.

In this blog we cover:

How we curate data
How we define metrics
How we run the evals

How we curate data

There’s a few ways we source evals:

Using feedback from dogfooding our agents
Pulling selected evals from external benchmarks (like Terminal Bench 2.0 or BFCL) and often adapting them for a particular agent
Writing our own (artisanal) evals and unit tests by hand for behaviors we think are important

Note: We separate SDK unit and integration tests (system prompt passthrough, interrupt config, subagent routing) from model capability evals. Any model passes those tests, so including them in scoring adds no signal. You should absolutely write unit and integration tests, but this blog focuses solely on model capability evals.

Dogfooding agents & reading traces are great sources of evals

Traces give us data to understand agent behavior. Our goal is to understand each failure mode, propose a fix, rerun the agent, and track progress and regressions over time. For example, every interaction of Open SWE is traced, so those can easily become evals to make sure the mistake doesn’t happen again.

We group evals by what they test

| Category | What It Tests | |---|---| | file_operations | File tools (read, write, edit, ls, grep, glob), parallel invocation, pagination | | retrieval | Finding information across files, search strategies, multi-hop document synthesis | | tool_use | Selecting the right tool, chaining multi-step calls, tracking state across turns | | memory | Recalling seeded context, extracting implicit preferences, persisting durable info | | conversation | Asking clarifying questions for vague requests, sustaining multi-turn dialogue | | summarization | Handling context overflow, triggering summarization, recovering info | | unit_tests | SDK plumbing - system prompt passthrough, subagent routing, etc. |

How we define metrics

When choosing a model, we start with correctness. Measuring correctness depends on what's being tested. Most internal evals use custom assertions. For evals where correctness is semantic, we use LLM-as-a-judge.

Once several models clear that bar, we move to efficiency. Metrics we measure include: