Are we any good at measuring how intelligent AI is?

This is a far more complex question than it seems. Most of us who use AI even cursorily have been impressed by some of the things it can do, even considering occasional oopsies.

But those are obviously subjective reactions, and an entire industry is dedicated to ‘measuring’ how ‘intelligent’ a particular AI is. The measuring sticks are called benchmarks, and the benchmarking industry is in somewhat of a crisis.

The first time most of us heard about this was when ChatGPT 3.5 passed the US Medical Licensing Exam (USMLE) late in 2022. It was somewhat of a shock, judging by the breathless news coverage. In short order, ChatGPT was put through its paces in US bar exam simulations (2023), where GPT-4 scored in the top 10%, the SATs, where ChatGPT-4 also landed in the top 10%, and the MMLU (a comprehensive test of language understanding) where ChatGPT-4 did better than the average human, improving consistently since.

It might have continued in this vein, but ChatGPT very quickly attracted competition – Claude, Gemini, Copilot, DeepSeek, Mistral and others. The benchmarking industry then became the tail wagging the dog, because every new AI model wanted to show that it was the best. They wanted to top their competitors, and what better way to do this than by scoring higher on existing benchmarks? It was the ultimate product brag – “We beat our competitors in the latest benchmark tests! Our score was number one! Use our product! Theirs is not very smart!”

And so benchmarks, and what they purported to test, proliferated promiscuously.

Acronym-itis

How many tests are there and what do they test? Sadly, many suffer from acronym-itis, and the others have similarly inscrutable names.

Here is a partial list, stretching back 25 years – MNIST, GLUE, ImageNet, SQuAD 1.0, SQuAD 2.0, Switchboard, BBH, SuperGLUE, GSK8k, HellaSwag, HumanEval, LiveBench, ARC-AGI, TruthfulQA. They test image classification, visual reasoning, language understanding, language inference, reading comprehension, coding, mathematics, predictive reasoning, common-sense reasoning, ethics and bias. There are numerous others in specific domains of expertise, a thicket of exams and cognitive capabilities (there is even a Pokémon benchmark testing how quickly an AI can finish the game, for goodness’ sake).

Here’s the problem: AI benchmarks no longer behave like good tests. The moment they’re released, the fast-improving big ‘frontier’ AI models swallow them whole, ace the exam, and post their scores on social media like smug honour students. GLUE was meant to keep models busy for years; it lasted nine months. SuperGLUE was supposed to be tougher; gone in 18 months. Even MMLU, a massive suite of test questions, is now teetering on the brink of irrelevance as newer models breeze past.

Why does this happen? Because benchmarks leak. Training data is scraped from the internet, and benchmarks live on the internet. The result is a suspiciously high chance that your AI has already ‘seen’ the test. Imagine giving a final exam only to discover half your students downloaded it from Reddit last year. That’s where we are.

And even when models haven’t memorised the answers, they’re trained to optimise for benchmarks themselves. Labs literally tune their models to squeeze out a few extra percentage points, a process affectionately known as ‘benchmarketing’.

Cheating spectacularly

The result: benchmarks stop telling us how intelligent a system is, and start telling us how well it can regurgitate exam prep. This leaves the human examiners increasingly unsure whether AI is truly learning – or just cheating spectacularly.

In the face of this, the answer to the question posed in this article’s headline is an awkward ‘we don’t know’. Every time we try to measure it, our yardsticks soon dissolve. This leads to the question as to why we want to measure it in the first place. Perhaps it’s because exams are the only testing ritual we understand, a way to convince ourselves that we’re still in charge of the classroom.

Things brings us to Humanity’s Last Exam, a proposal dreamed up by Dan Hendrycks of the Center for AI Safety in 2024. The idea: create a benchmark so hard, so rich in reasoning, so resistant to leakage, that only truly expert-level AI could pass. It’s a brilliant name, partly because it frames the stakes with theatrical flair. It was administered for the first time in January of this year. All the major frontier AGI’s scored 10% or less. ChatGTP 5, released in the last few weeks, just hit 25%. How long do you think it will take for this to become obsolete too?

Can we learn to evaluate intelligence without reducing it to a leaderboard? Can we distinguish between passing the bar and giving good legal advice? Can we build new ways of judging these systems that go beyond metrics and address the wider and messier complexity of our species’ smarts?

Because if we can’t, then we can never know whether AI is as intelligent, or even more intelligent than us. And then it will be too late, because one day we will be the student and it will be the teacher and we won’t even notice.

[Image: Nahrizul Kadri on Unsplash]

The views of the writer are not necessarily the views of the Daily Friend or the IRR.

If you like what you have just read, support the Daily Friend

Are we any good at measuring how intelligent AI is?

Steven Boykey Sidley