Is it agentic enough? Benchmarking open models on your own tooling

This post explores how to benchmark and optimize software libraries for AI agents, using Hugging Face's transformers as a case study to measure token usage, latency, and success rates.

Benchmarking transformers revisions across different metrics

This is a human-made, agent-focused blogpost.

Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us developers, but it now also sends the agent down a longer, more expensive path.

Most benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got it right, but how much work it took to get there, and how that shifts across models, library revisions, and tasks. We measured exactly that, using transformers

as our case study.

Here, we will introduce a tool specific benchmark focusing on how the answer was found, and provide a simple implementation of one such harness, running entirely on open models driven by the pi coding agent, with the full sweep of models × revisions × tasks fanned out across Hugging Face Jobs so every run sees identical hardware.

But, how do you optimize software for agents?

We're strong believers in the following two software principles:

If it isn't tested, then it doesn't work
If it isn't documented, then it doesn't exist

This remains the same within the realm of agentic-optimized tooling, and, for once, the two are directly tied to each other.

You want your tool to exist for an agent: it needs to be discoverable. The API needs to be clear and the docs need to be extensive. They need to be structured in a way that the agent has rapid access to the useful files and examples. If you want your tool to work for an agent, then you should test it for agentic-use.

We'll use transformers

as an example throughout this blogpost: agents using it to solve ML tasks (classifying text, captioning images, transcribing audio), not contributing code to it; though the harness was designed to work with any tool that can be operated from the command line.

Our intuition on transformers

was that usage could be dramatically simplified with a few changes: a CLI, a Skill, and self-contained, task-specific examples. This is the same recipe recently applied to the hf

CLI, redesigned to be agent-optimized, where agents used 1.3–1.8× (and up to 6×) fewer tokens. We wanted to know whether that kind of win generalizes, and whether it could be useful for transformers as well.

Intuition is a powerful tool, but we wanted more evidence before we opened PRs that add several thousand lines of code to such a widely used codebase as transformers

. We set out to measure what success looks like.

Two agents can both produce the correct label for a sentiment-classification task, but one:

writes a 40-line Python script, imports transformers

, debugs a shape error, re-runs twice, and finally prints the answer;

while the other

types transformers classify --model ... --text "..."

and is done in one call.

Both reach POSITIVE (0.9999)

, and here are the two paths an agent actually took on this exact task:

# Task: classify the sentiment of "I absolutely loved the movie, it was fantastic!"
- # one agent: pipe a script into python and parse the output
- python - <<'PY'
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
- import torch
- import torch.nn.functional as F
- 
- model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
- tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
- inputs = tokenizer("I absolutely loved the movie, it was fantastic!", return_tensors="pt")
- with torch.no_grad():
- logits = model(**inputs).logits
- probs = F.softmax(logits, dim=1)
- idx = torch.argmax(probs, dim=1).item()
- print(model.config.id2label[idx], probs[0][idx].item())
- PY
+ # the other agent: one command
+ transformers classify \
+ --model distilbert/distilbert-base-uncased-finetuned-sst-2-english \
+ --text "I absolutely loved the movie, it was fantastic!"

Both methods reach the same result. But they have very different profiles in cost, latency, token usage, and failures.

If your evaluation only checks the final string, you're blind to these as well as whether a change you shipped to the library (a CLI improvement, better error messages, a Skill) actually helped agents.

Our goal with this harness is to evaluate how much work an agent has to do to perform a given task, and whether changes to the library improve performance.

A few words on how we'll evaluate agents here.

We run every task under three variants (or "tiers"); three different ways an agent can come at transformers

bare pip install transformers, and nothing else
clone the full transformers source, checked out in the working directory
skill a packaged Skill: the CLI's docs + task examples, loaded in context

These aren't nested: skill

doesn't contain clone

(it ships curated docs, not the source tree), and neither strictly contains the other, each gives the agent a different kind of help. As we'll see, a model can sometimes do better on clone

than on skill

A few more choices:

For now we only focus on deterministic tasks which can provide an exact match, as they provide a very nice ground for experimentation. Model-as-a-judge and other schemes are the obvious next steps for other tasks.
Every run is its own Hugging Face Job: one per (model × revision × task), so the whole sweep runs in parallel on identical hardware, which keeps the comparison fair at scale.
Results and traces land in a Hugging Face Bucket: fast, no versioning needed, and handles very high write concurrency.

Not all models driving agents are equal, and their difference changes what you should look at when running them.

Large open models

At one end, you have the largest, most capable open models. On reasonably common tasks, these should get the right answer, eventually. For them, task completion saturates near 100% and stops telling you much about your tool; a more relevant benchmark is the effort it took the agent to get there: how many turns, tokens and seconds it took, and whether they walked a clean path or used deprecated APIs.

Local

Local models vary widely in size, and so do their abilities. Metrics such as "match %" are more relevant than for their larger counterparts, as you can see how model sizes/capabilities affect results on your specific tool.

This harness not only provides guidance to library maintainers on how to improve a repository for agent interactions, it also helps assess how different agents and models perform on the tasks users care about.

The harness scores every run on several axes, so that you can ask what actually matters for each class of model:

match %: did the final answer contain the expected result (per-task, case-insensitive substring / regex / exact, all explicit in the report);median timeandmedian tokens(new vs. cached vs. generated);runs with error %: including a guard that flags runs which producednothing(0 output tokens, no tool calls, no answer) so silent failures don't masquerade as "0";marker adoption: tool-defined behavior markers; see below for an explanation of what this is.

All of it lands in a report you can directly examine:

The live report: Overview, Coverage, and Results, all client-side.

And because it captures the native agent trace of every run, numbers are just the beginning: you can read exactly what the agent did, command by command. The traces are shareable through the Hub's agent-traces viewer:

A run rendered in the Hub's agent-traces viewer: MiniMax-M2.7 on the answer-question task.

Source: Hugging Face Blog