Rising Cost of Evaluations, Falling Cost of Intelligence

Summary

Ben Cottier explores the tension between the increasing costs of AI evaluations and the decreasing cost of widespread advanced AI availability.

SESSION Transcript

I'm going to talk about an important tension between two recent trends in AI, which I think poses a challenge for evaluating and preparing for AI capabilities that are around the corner.
So, less than a year ago we saw reasoning models emerge. And these are models like OpenAI o3 and famously DeepSeek R1. And these models are trained to perform longer chains of reasoning to solve more challenging tasks, especially in coding, mathematics and science. And with reasoning models we see this behavior where the models can perform better and better the more you spend running them.
Basically, if you get them to think for longer, then the accuracy improves on many tasks. At the same time, it quickly gets cheaper to run a model at a given level of performance. What evidence do we have for this? Well, we can look at the price of running a model that achieves some score on a benchmark, for example GPT-4 level performance or better on a set of PhD level science questions.
And what we see is a very rapid decrease in these prices over time. This is one of the rare plots in AI where the line is going down. Now the rate of decrease that we find is there's a very wide range. It could be anywhere from nine-fold decrease per year to 900-fold.
And it's also unclear if this trend is just going to continue at such a rapid rate in the coming years. But we can be fairly confident based on this that once a new capability is first established at the frontier, it's quickly going to become much cheaper to use that capability and access it.
So there are two trends that are going on simultaneously, and both are important to keep in mind. There's scaling and there's efficiency. So with scaling, the more you spend running a model, the better capabilities it gets. But over time it gets cheaper to run that same capability.
So this is important to keep in mind if we're trying to prepare for new AI capabilities. With reasoning models, we're likely to see big gains in capabilities in the next couple of years. And to evaluate these capabilities in a thorough and timely way is going to be relatively expensive, perhaps on the order of millions of dollars per year.
But meanwhile it will likely get much cheaper to run models with those same capabilities soon after they're first established. And that will make those capabilities more accessible. So if we don't run these expensive evaluations to get an accurate picture of what's possible with AI today, it might be hard to foresee potentially large and broad impacts from AI that are around the corner.
Thank you.