Book: The Emerging Science of Machine Learning Benchmarks

An exploration of the paradoxical role of benchmarks in AI development, explaining why they remain the primary driver of progress despite significant statistical and ethical flaws.

Machine learning turns on one simple trick: Split your data into training and test sets. Anything goes on the training set; rank the models on the test set. Let the model builders compete. Call this a benchmark.

Machine learning researchers cherish a good tradition of lamenting the shortcomings of machine learning benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming the metrics, leading to inflated scores. Goodhart’s law cautions against competing over statistical measurements, but benchmarking ignores the warning. Over time, critics say, researchers overfit to benchmark datasets, building models that exploit artifacts. As a result, test set performance draws a skewed picture of model capabilities, deceiving us especially when comparing humans and machines. Add to this a slew of reasons why things don’t transfer from benchmarks to the real world.

These scorching critiques go hand in hand with ethical objections. Benchmarks reinforce and perpetuate biases in our representation of people, social relationships, culture, and society. Worse, the creation of massive human-annotated datasets extracts labor from a marginalized workforce excluded from the economic gains it enables.

All of this is true. Many have said it well. The critics have argued it convincingly. I’m particularly drawn to the claim that benchmarks serve industry objectives, giving big tech labs a structural advantage. The case against benchmarks is clear, in my view. What’s far less clear is the scientific case for benchmarks.

It’s undeniable that benchmarks have been successful as a driver of progress in the field. ImageNet was inseparable from the deep learning revolution of the 2010s, with companies competing fiercely over the best dog breed classifiers. A decade later, language model benchmarks reached geopolitical significance in the global competition over artificial intelligence. Tech CEOs recite the company’s number on MMLU—a set of college-level multiple-choice questions—in presentations to shareholders. News that DeepSeek’s R1 beat OpenAI’s o1 on some challenging reasoning benchmarks launched a frenzy that shook global stock markets.

Benchmarks come and go, but their centrality hasn’t changed. Competitive leaderboard climbing has been the main way machine learning advances. If we accept that progress in artificial intelligence is real, we must also accept that benchmarks have, in some sense, worked. But the fact that benchmarks worked is more of a hindsight observation than a scientific lesson. Benchmarks emerged in the early days of pattern recognition. They followed no scientific principles. To the extent that benchmarks had any theoretical support, that theory was readily invalidated by how people used benchmarks in practice. Statistics prescribed locking test sets in a vault, but machine learning practitioners did the opposite. They put them on the internet for everyone to use freely. Popular benchmarks draw millions of downloads and evaluations as model builders incrementally compete over better numbers.

Benchmarks are the mistake that made machine learning. They shouldn’t have worked and, yet, they did. In this book, my goal is to shed light on why benchmarks work and what for.

The first part of this book covers foundations, some mathematical, some empirical. The first hai chapters after the introduction add just enough standard background material to make the book self-contained. Here, I stick closely to the canon. The next few chapters cover the train/test split, called the holdout method. I start with the classical guarantees for the holdout method and related tools in the family of cross-validation methods. These guarantees, however, don’t apply to how people use the holdout method in practice. The problem is adaptivity: Repeated use creates a feedback loop between the model and the data that invalidates traditional analysis. This problem of adaptivity is a cousin of Freedman’s paradox, a conundrum that has vexed statisticians since the 1980s. Freedman noticed how easily data-dependent statistical analyses can go wrong.

Freedman’s observation foreshadowed an ongoing scientific crisis in the statistical sciences. Evidently, successful replication is limited and false discovery common when researchers compete on the basis of statistics, such as p-values. But p-values aren’t the main culprit. Researcher degrees of freedom always seem to outwit statistical measurement. Indeed, Goodhart’s law predicts that statistical measurement breaks down under competitive pressure. What does that say about the benchmarking ecosystem, where researchers compete over statistics computed on a fixed test set?

The preconditions for crisis exist in machine learning, too. For one, it shares the Achilles’ heel of statistical measurement with other empirical sciences. In addition, machine learning operates in an ecosystem of maximal researcher degrees of freedom, rapid publication, and weak peer review. It might come as no surprise that absolute accuracy numbers—thought of as measurements of some capability—are woefully unreliable, failing to replicate even under similar conditions. Nevertheless, the situation in machine learning is markedly different. Model rankings replicate to a surprising degree. More specifically, three empirical facts emerge from the ImageNet era:

If machine learning appears to have thwarted scientific crisis, the question is why. I argue that the social norms and practices of the community rather than statistical methodology alone are key to understanding the function of benchmarks. A fundamental result shows that if the community only cares about identifying the best performing model at any point in time, the holdout method enjoys surprisingly strong theoretical guarantees.

Summarizing these lessons, model rankings—rather than model evaluations—are the primary scientific export of machine learning benchmarks.

The first part of the book draws on lessons primarily from the ImageNet era, that is, roughly the decade following 2012. The ImageNet era was marked by a single central benchmark that featured both a training set and a test set. Its creators took care to clean labels thoroughly through aggregation. A chapter on data labeling and annotation shows why some common practices of label cleaning are inefficient when the primary goal is model ranking.

The second part of this book (starting with Chapter 10) is about recent developments around generative models, in particular, large language models. I cover the basics of large language models, scaling laws, emergent abilities, and post-training methods, necessary to appreciate the challenges of benchmarking in this day and age.

The new era departs from the old in some significant ways.

First, models train on the internet, or at least massive minimally curated web crawls. At the point of evaluation, we therefore don’t know and can’t control what training data the model saw. This turns out to have profound implications for benchmarking. The extent to which a model has encountered data similar to the test task during training skews model comparisons and threatens the validity of model rankings. A worse model may have simply crammed better for the test. Would you prefer a worse student who came better prepared to the exam, or the better student who was less prepared? If you prefer the latter, then you’ll need to adjust for the difference in test preparation. Thankfully this can be done by fine-tuning each model on the same task-specific data before evaluation without the need to train from scratch.

Second, models no longer solve a single task, but can be prompted to tackle pretty much any task. In response, multi-task benchmarks have emerged as the de facto standard to provide a hol

Source: Hacker News