Are the costs of AI agents also rising exponentially? (2025)

While AI agent capabilities are growing exponentially, the cost of performing these tasks is often overlooked. This article explores whether AI is becoming less cost-competitive compared to human labor.

There is an extremely important question about the near-future of AI that almost no-one is asking.

We’ve all seen the graphs from METR showing that the length of tasks AI agents can perform has been growing exponentially over the last 7 years. While GPT-2 could only do software engineering tasks that would take someone a few seconds, the latest models can (50% of the time) do tasks that would take a human a few hours.

As this trend shows no signs of stopping, people have naturally taken to extrapolating it out, to forecast when we might expect AI to be able to do tasks that take an engineer a full work-day; or week; or year.

But we are missing a key piece of information — the cost of performing this work.

Over those 7 years AI systems have grown exponentially. The size of the models (parameter count) has grown by 4,000x and the number of times they are run in each task (tokens generated) has grown by about 100,000x. AI researchers have also found massive efficiencies, but it is eminently plausible that the cost for the peak performance measured by METR has been growing — and growing exponentially.

This might not be so bad. For example, if the best AI agents are able to complete tasks that are 3x longer each year and the costs to do so are also increasing by 3x each year, then the cost to have an AI agent perform tasks would remain the same multiple of what it costs a human to do those tasks. Or if the costs have a longer doubling time than the time-horizons, then the AI-systems would be getting cheaper compared with humans.

But what if the costs are growing more quickly than the time horizons? In that case, these cutting-edge AI systems would be getting less cost-competitive with humans over time. If so, the METR time-horizon trend could be misleading. It would be showing how the state of the art is improving, but part of this progress would be due to more and more lavish expenditure on compute so it would be diverging from what is economical. It would be becoming more like the Formula 1 of AI performance — showing what is possible, but not what is practical.

So in my view, a key question we need to ask is:

** How is the ‘hourly’ cost of AI agents changing over time?**

By ‘hourly’ cost I mean the financial cost of using an LLM to complete a task right at the model’s 50% time horizon divided by the length of that time horizon. So as with the METR time horizons themselves, the durations are measured not by how long it takes the model, but how long it typically takes humans to do that task. For example, Claude 4.1 Opus’s 50% time horizon is 2 hours: it can succeed in 50% of tasks that take human software engineers 2 hours. So we can look at how much it costs for it to perform such a task and divide by 2, to find its hourly rate for this work.

I’ve found that very few people are asking this question. And when I ask people what they think is happening to these costs over time, their opinions vary wildly. Some assume the total cost of a task is staying the same, even as the task length increases exponentially. That would imply an exponentially declining hourly rate. Others assume the total cost is also growing exponentially — after all, we’ve seen dramatic increases in the costs to access cutting-edge models. And most people (myself included) had little idea of how much it currently costs for AI agents to do an hour’s software engineering work. Are we talking cents? Dollars? Hundreds of dollars? An AI agent can’t cost more per hour than a human to complete these tasks can it? Can it?

⁂

A couple of months ago I asked METR if they could share the cost data for their benchmarking. I figured it would be easy — just take the cost of running their benchmark for each model, plot it against release date and see how it is growing. Or plot the cost of each model vs its time horizon and see the relationship.

But they helpfully pointed out that it isn’t so easy at all. Their headline time-horizon numbers are meant to show the best possible performance that can be attained with a model (regardless of cost). So they run their models inside an agent scaffold until the performance has plateaued. Since they really want to make sure it has plateaued, they use a lot of compute on this and don’t worry too much about whether they’ve used too much. After all, if you are just trying to find the eventual height of a plateau, there is no problem in going far into the flat part of the graph.

But if you are trying to find out when the plateau begins, there is a problem with this strategy. Their total spend for each model is sometimes just enough to get onto the plateau and sometimes many times more than is needed. So total spend can’t be used as direct estimate of the costs of achieving that performance.

Fortunately, they released a chart that can be used to shed some light on the key question of how hourly costs of LLM agents are changing over time:

This chart (from METR’s page for GPT-5) shows how performance increases with cost. The cost in question is the cost of using more and more tokens to complete the task (and thus more and more compute).

The yellow curve is the best human performance for each task. It steadily marches onwards and upwards, transforming more wages into longer tasks. Since it is human performance that is used to define the vertical axis for METR’s time horizon work, it isn’t surprising that this curve is fairly linear — it costs about 8 times as much to get a human software engineer to perform an 8-hour task as a 1-hour task.

The other colours are the curves for a selection of LLM-based agents. Unlike the humans, they all show diminishing returns, with the time horizon each one can achieve eventually stalling out and plateauing as more and more compute is added.

The short upticks at the end of some of these curves are an artefact of some models not being prepared to give an answer until the last available moment. This suggests that the model must have been still making progress during the apparent flatline before the uptick (just not showing it). Indeed, this chart was originally displayed on METR’s page for GPT-5 to show that they may have stopped its run before it’s performance had truly plateaued. These upticks do make analysis harder and hopefully future versions of this chart will be able to avoid these glitches.

⁂

So what can this chart tell us about our key question concerning the hourly cost of AI agents?

To tease out the lessons that lie hidden in the chart, we’ll need to add a number of annotations. The first step is to add lines of constant hourly cost. On a log-log plot like this, every constant hourly cost will be a straight line with slope 1. Lower hourly costs will appear as lines that are located further to the left.

For each curve I’ve added a line of constant hourly cost that just grazes it. That is the cheapest hourly cost the model achieves. We can call the point where the line touches the curve the sweet spot for that model. Before a model’s sweet spot, its time horizon is growing super-linearly in cost — it is getting increasing marginal returns. The sweet spot is exactly the point at which diminishing marginal returns set in (which would correspond to the point of inflection if this was replotted on linear axes). It is thus a key point on any model’s performance curve.

We can see that the human software engineer is at best $120 per hour, while the sweet spots for the AI agents range from $40 per hour for o3, all the way down to 40 cents per hour for Grok 4 and Sonnet 3.5. That’s quite a range of costs. While differences in horizon length between these models vary by about a factor of 15 (judged at either the end-points or at the sweet-spots) their sweet-spot costs vary by a factor of 100.

And these are the best hourly rates for these models. On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet s

Source: Hacker News