Better Harness: A Recipe for Harness Hill-Climbing with Evals

LangChain introduces Better-Harness, a system for improving AI agents by optimizing their harness layer. By using evals as a learning signal, this approach enables agents to achieve higher performance while maintaining generalization.

By Vivek Trivedy, Product Manager @ LangChain

We can build better agents by building better harnesses. But to autonomously build a “better” harness, we need a strong learning signal to “hill-climb” on. We share how we use evals as that signal, plus design decisions that help our agent generalize instead of overfit. Better-Harness is a system for iteratively sourcing and improving your harness with evals.

TL;DR:## Evals are training data for Agents

In classical machine learning, training data guides the model’s learning process. Each training example contributes a gradient that updates the model’s weights toward “correctness.” We have a similar learning loop for agents.

Evals encode the behavior we want our agent to exhibit in production. They’re the "training data" for harness engineering. Each eval case contributes a signal like “did the agent take the right action” or “produce the right outcome?” That signal guides the next proposed edit to the harness.

The same rigor and care we put into data quality and curation for model training should also go into eval design. We discuss the importance of data quality in a previous post, how we build evals for Deep Agents.

There’s some great recent work that formalize the steps to optimize harnesses including Meta-Harness from Stanford and Auto-Harness from DeepMind. We also previously shared a Harness Improvement Loop to hill-climb Terminal Bench 2.0 by just tweaking the harness layer. We think there’s great future work to be done around the update algorithm itself, but harness improvement is a compound system that goes beyond the update algorithm which is what we talk about here.

Better-Harness is a take on compound systems engineering.

data sourcing → experiment design —> optimization —> review & acceptance

So we include practical details that go alongside the update loop such as how we source evals in the first place, how we design against overfitting, store Traces over time, and manually review updates to sanity check anything we ship to production.

Sourcing good evals

Evals are the foundation that power the harness hill-climbing process. Here are the practical ways we source, curate, and use them.

Hand-curated. For any given task, the team manually writes examples that capture what we think the agent should do in production. These are often high value, but difficult to generate at scale.

Production traces. Every agent interaction generates a trace where failures become eval cases. Mining traces for eval material is the leverage, high-throughput way to improve evals over time. Even before running an agent over evals, often a team dogfooding our agent will report errors directly in Slack with a Trace link. We recommend dogfooding agents and directly sharing feedback for everyone to see, it helps build shared knowledge of agent behavior.

External datasets. These datasets are useful but need to be manually curated to make sure the test cases used to improve the agent reflect desired behaviors. Often each task is adjusted to make sure they measure the important behavior.

Tag everything. Every eval gets tagged to behavioral categories: "tool selection," "multi-step reasoning," etc. Tags enable meaningful holdout sets and targeted experiments. It also saves a lot of money because we can run subsets of evals.

Building learning systems that generalize

The ideal outcome for any learning system is generalization. We give an input signal that captures the distribution of behaviors we want in the wild. The system fits to it and then “just works” on new inputs it's never seen.

The obvious problem: We don't have unlimited data.

The fix: Encode important behaviors into curated evals. Quality > quantity, a small set of well-tagged evals covering the behaviors you care about beats thousands of noisy but high-coverage evals.

The subtle problem → agents are famous cheaters: Any learning system is prone to reward hacking where the agent overfits its structure to make the existing evals pass that it can see. This makes sense because the loop just wants to “make number go up” and doesn't know about generalization. We prompt to avoid overfitting but it isn’t perfect.

The fix: Holdout sets become a proxy for true generalization. We’ve seen approaches that We pair with human review as a second signal and we get semi-automated systems can improve scores while avoiding behaviors we don’t want in prod.

Better-Harness: a recipe for hill climbing your harness

We created a scaffold for autonomously improving our harness using evals as a signal in each step. A research version is open sourced here, here are the main steps:

**Source and tag evals.**This is a mix of hand-writing evals, mining them from production traces, and using/adapting external datasets. We tag each eval to behavioral categories (like multi-step retrieval) and regularly remove evals that are saturated or we longer feel are useful for the agent + current generation of models.**Split data per category.**Create Optimization and Holdout sets. This is very important! We find that autonomouos hill-climbing has a tendency to overfit to tasks so holdout sets ensure that learned optimizations work on previously unseen data, though the general distirbution should match existing evals. This mirrors what production will look like.**Run a Baseline.**Run a baseline experiment on the Optimization & Holdout sets before any edits. This grounds all updates in the update steps.**Optimize.**Each iteration runs autonomously with optional human review:Diagnosefrom traces. Scores aggregate performance over categories and then Traces show the details of what went wrong and why.Experimenta targeted harness change. We scope to one change at a time to avoid confounding but that may mean updating a prompt and tool simultaneously so the system works well together.

**Validate:**In each step, the loop checks to make sure that the proposed change helped pass new evals while avoiding regressions on existing passing cases. It’s common that some change results in a net overall score gain with some regressions. The agent gets context of these regressions so it can try to fix them in the next update without losing the gains from the existing update.**Human review.**We manually review changes and edge cases metrics miss. This often includes instructions that are overfit to the optimization set and although they don’t hurt generalization, they end up being a waste of tokens. This gives us another sanity check and gate against overfitting.

Examples of harness changes

Here are the kinds of changes the optimization loop can discover and validate:

Prompt and instruction updates. The most common change. The agent keeps misinterpreting a tool's output format, or it's too aggressive about calling a tool when it should ask a clarifying question first. The fix is a targeted instruction update addition like "when querying multiple files that have dependent information, offload information to the filesystem and re-aggregate before giving a final answer."

Adding or updating a tool or tool description. The agent may fail contextualizing when to use a new tool. Edits include examples on of how to use, how to chain this tool, an updated tool description, and editing the overall tool suite to disambiguate similar tools

Results from the Better-Harness loop

We tested this approach with Claude Sonnet 4.6 and Z.ai’s GLM-5 on a subset of our evals. Note: We have other work underway generalizing Better-Harness across many models in deepagents using a bigger eval suite. The goal is to publish a series of model profiles that capture the nuances of each model tuned for our evals as a public artifact.

Source: LangChain Blog