GPQA, SWE-bench & Area Elo

Meta simply launched Muse Spark. The announcement says it beats GPT-5.4 on well being duties, ranks top-five globally on the Synthetic Evaluation Intelligence Index, and scores 89.5% on one thing referred to as GPQA Diamond.

Eleven months in the past, Meta stated virtually an identical issues about Llama 4, earlier than folks truly used it and the numbers collapsed.

So what are these benchmarks? How do the scores get calculated? And why does a mannequin that tops each leaderboard typically really feel mediocre the second you employ it?

This information explains what the largest AI benchmarks truly measure, together with MMLU, GPQA Diamond, HumanEval, SWE-bench, HealthBench, Humanity’s Final Examination, and Chatbot Area. It additionally explains how benchmark scores are calculated, why some assessments matter greater than others, and the way AI labs can inflate benchmark outcomes with out bettering real-world efficiency.

What Is an AI Benchmark?

A benchmark is only a standardized take a look at. A hard and fast set of questions or duties, given to each AI mannequin in the identical approach, scored the identical approach. The thought is that if everybody takes the identical take a look at, you’ll be able to examine the outcomes pretty. However there is a apply the AI group has began calling benchmaxxxing: squeezing each potential level out of a benchmark via analysis selections, cherrypicked settings, and coaching methods that enhance the rating with out essentially bettering the mannequin.

We’ll get into the specifics of how this works as we undergo every benchmark.

MMLU and MMLU-Professional: The Information Take a look at

What it’s: Over 15,000 multiple-choice questions throughout 57 topics. Legislation, drugs, chemistry, historical past, economics, laptop science. 4 reply selections per query.

What an precise query seems like:

A 60-year-old man presents with progressive weak spot, hyporeflexia, and fasciculations in each legs. MRI reveals anterior horn cell degeneration. Which of the next is the probably analysis? (A) A number of sclerosis (B) Amyotrophic lateral sclerosis (C) Guillain-Barré syndrome (D) Myasthenia gravis

The mannequin outputs a letter. The take a look at runner checks if it matches the reply key.

How the rating is calculated: Earlier than every query, the mannequin is proven 5 instance questions with right solutions, that is referred to as 5-shot prompting. Then comes the true query. Rating = right solutions ÷ whole questions, expressed as a proportion.

Why it is almost ineffective in 2026: High fashions now rating above 88% on MMLU. GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Professional are all bunched collectively above 87%. The take a look at can now not separate them, it is like utilizing a rest room scale to measure the burden distinction between two folks of comparable construct. Technically potential, virtually meaningless.

Researchers responded by constructing MMLU-Professional: similar topics, tougher questions, ten reply selections as a substitute of 4, with choices designed to look believable even to educated people. On MMLU-Professional, the gaps between fashions begin displaying up once more.

→ Whenever you see MMLU in a press launch in 2026, it is largely padding. It is also the benchmark probably to be inflated by coaching knowledge contamination: fashions have had three years of web knowledge that overlaps closely with MMLU-style questions.

GPQA Diamond: The Scientific Reasoning Take a look at

That is probably the most credible tutorial benchmark in use at this time. The way in which it was constructed is what makes it reliable.

How the questions have been made: Researchers employed PhD scientists in biology, physics, and chemistry. Every scientist wrote a query in their very own area. Then a second PhD scientist in the identical area tried to reply it. If that second skilled received it flawed, the query handed the filter. Then three extra folks, sensible non-domain consultants given limitless web entry and half-hour, tried to reply it. If in addition they failed, the query made it into the Diamond subset.

The end result: 198 questions that require you to truly cause via onerous science. You can not Google them. The solutions aren’t in Wikipedia.

What an precise query seems like:

Two quantum states with energies E1 and E2 have a lifetime of 10⁻⁹ sec and 10⁻⁸ sec, respectively. We wish to clearly distinguish these two power ranges. Which of the next could possibly be their power distinction to allow them to be clearly resolved? (A) 10⁻⁸ eV (B) 10⁻⁹ eV (C) 10⁻⁴ eV (D) 10⁻¹¹ eV

To reply this, you have to know the energy-time uncertainty precept from quantum mechanics, calculate the pure linewidths of the power ranges, and test which power distinction is massive sufficient to resolve them. The reply is (A), however you’ll be able to’t discover that by looking. You need to derive it.

How the rating is calculated: Similar letter-pick system as MMLU. The mannequin is informed to cause step-by-step and should finish its response with “ANSWER: LETTER” – capital letters solely. If the mannequin would not observe that precise format, it will get zero for that query no matter whether or not the reasoning was right. This strict formatting rule is intentional: it forces fashions to decide to a selected reply relatively than hedging.

The benchmark in numbers:

Random guessing: 25% (4 selections)
Good non-experts with web entry: 34%
PhD-level area consultants: 65%
GPT-4 when it launched (2023): 39%
Muse Spark at this time: 89.5%
Gemini 3.1 Professional: 94.3%
Claude Opus 4.6: 92.8%

That bounce from 39% to 89% in three years is actual. These fashions have genuinely gotten higher at scientific reasoning. However Muse Spark remains to be about 5 factors behind Gemini on this take a look at, throughout 198 questions. That is roughly 10 questions. Meta calls this “aggressive” which is technically correct.

HumanEval: The Primary Coding Take a look at

What it’s: 164 Python programming issues. Every downside is a perform signature with a docstring explaining what the perform ought to do.

What an precise query seems like:

The mannequin writes the perform physique. An automatic take a look at runner then executes the code towards 10-15 hidden take a look at instances, inputs with identified right outputs. Both each take a look at case passes, or the issue fails.

How the rating is calculated: The primary metric is cross@1: did the mannequin’s first try cross all of the hidden assessments? Rating = variety of issues the place the code labored ÷ 164 whole issues.

Instance of cross vs. fail:

An accurate answer for the above returns “fl” for [“flower”,”flow”,”flight”] and “” for [“dog”,”racecar”,”car”] and handles edge instances like an empty record. A mannequin that hardcodes the seen examples however fails on an edge case like a single-element record will get zero for that downside.

Why it is outdated: High fashions now remedy 90%+ of those 164 issues. They’ve had years to coach on HumanEval-style duties. Researchers overtly query what number of fashions might have seen these precise issues in coaching. Main with HumanEval in 2026 is sort of a automobile firm main their security pitch with a take a look at from 2015.

Curious to be taught extra?

See how our brokers can automate doc workflows at scale.

Guide a demo

SWE-bench: The Actual Software program Engineering Take a look at

What it’s: Actual GitHub points from actual open-source repositories. The mannequin is given the difficulty description and the complete codebase and should produce a code patch (a diff) that fixes the bug.

What an precise job seems like:

A developer recordsdata a GitHub difficulty within the sympy math library: “The simplify() perform returns the flawed end result when referred to as on expressions containing nested Piecewise objects below sure situations.”

The mannequin will get the difficulty textual content, navigates a codebase with hundreds of recordsdata, identifies the supply of the bug, and writes a patch. That patch is mechanically utilized to the codebase, and the prevailing take a look at suite runs to test that the repair works and did not break the rest.

How the rating is calculated: Move/fail on the difficulty degree. Rating = proportion of points the place the mannequin’s patch handed all assessments.

Why this benchmark issues greater than HumanEval: As a result of there is not any memorization shortcut. The repositories are actual, the bugs are actual, and the analysis surroundings is strictly managed. You both fastened the bug otherwise you did not.

The place Muse Spark stands right here: Meta’s personal weblog publish acknowledges “present efficiency gaps, particularly in coding workflows.” SWE-bench is nearly definitely the place that reveals up. Claude Opus 4.6 at the moment leads most coding evaluations.

Humanity’s Final Examination: The Frontier Reasoning Take a look at

What it’s: Round 2,500 questions written by researchers particularly designed to exceed what present AI can reply: PhD-level and past, throughout math, science, historical past, and legislation.

Why Muse Spark highlights it: In its “Considering” mode, which launches a number of sub-agents working in parallel on totally different elements of an issue, Muse Spark scored 50.2%. GPT-5.4 in its highest-effort mode scored 43.9%. Gemini’s Deep Suppose mode scored 48.4%.

That is Muse Spark’s most official lead throughout any benchmark. The hole is actual (6+ factors over GPT-5.4) and the benchmark is genuinely onerous. One caveat: Considering mode makes use of considerably extra compute than a typical response. You are paying, in time and in API value for that efficiency.

HealthBench: The Scientific Reasoning Take a look at

What it’s: Scientific and medical reasoning duties evaluated by physicians. Questions cowl affected person symptom interpretation, drug interactions, remedy choices, and well being data accuracy.

How the rating is calculated: Not like automated benchmarks, HealthBench solutions are graded towards physician-defined requirements. The rating represents the proportion of solutions that met medical accuracy necessities.

The numbers: Muse Spark 42.8%. GPT-5.4 40.1%. Gemini 3.1 Professional 20.6%.

42.8%. GPT-5.4 scored 40.1%. Gemini 3.1 Professional scored 20.6%. That is Muse Spark’s most defensible lead in any benchmark. A 22-point hole over Gemini on a physician-graded take a look at is important.

Muse Spark vs GPT-5.4 vs Gemini abstract desk

Chatbot Area: The Human Choice Take a look at

This one is totally different from each different benchmark, and understanding the way it works explains the Llama 4 scandal.

What it assessments: Whether or not a human person prefers one mannequin’s response over one other.

The way it works: Two nameless fashions are proven the identical immediate. An actual person reads each responses and picks which one they like. Hundreds of thousands of those pairwise comparisons are run. The outcomes feed right into a statistical mannequin referred to as Bradley-Terry, which converts win/loss information into ELO scores: the identical system used to rank chess gamers.

If Mannequin A beats Mannequin B in 60% of comparisons, Mannequin A will get extra factors. Over time, after sufficient comparisons, the rankings stabilize right into a leaderboard.

Why this benchmark is gameable: Human customers are inclined to choose responses which might be lengthy, confident-sounding, and well-formatted, even when a shorter, extra correct reply would serve them higher. A mannequin that provides enthusiasm, makes use of daring textual content, and provides elaborately structured responses will rating higher on LMArena than a mannequin that provides a direct, right reply in two sentences.

And that is what occurred with Llama 4.

The Llama 4 Incident

When Meta launched Llama 4 in April 2025, its announcement stated the mannequin ranked #2 on LMArena, simply behind Gemini 2.5 Professional, with an ELO rating of 1417. That quantity was technically correct, however the mannequin that earned that rating was not the one being launched to the general public.

The mannequin Meta submitted to LMArena was referred to as “Llama-4-Maverick-03-26-Experimental.” Researchers who later in contrast it towards the publicly downloadable model discovered constant behavioral variations:

The experimental model (LMArena): verbose responses, heavy use of emojis, elaborate formatting, dramatic construction, lengthy gildings even for easy questions.

The general public model (what you’d truly use): concise, plain, direct, no emojis.

LMArena’s voting system reliably most popular the primary model. Actual customers in actual use instances most popular the second. When the precise public mannequin was individually added to the leaderboard, it ranked thirty second.

There’s one other quantity value realizing: when LMArena turned on Type Management, eradicating the formatting and size benefit, Llama 4 Maverick dropped from 2nd place to fifth. The mannequin’s content material high quality, stripped of its presentational packaging, was a lot much less spectacular.

LMArena acknowledged publicly: “Meta’s interpretation of our coverage didn’t match what we count on from mannequin suppliers. Meta ought to have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a personalized mannequin to optimize for human choice.” They up to date their submission guidelines after.

And on ARC-AGI: a benchmark designed to check real novel reasoning, not sample matching, Llama 4 Maverick scored 4.38% on ARC-AGI-1, and 0.00% on ARC-AGI-2. This was by no means within the press launch.

Curious to be taught extra?

See how our brokers can automate doc workflows at scale.

Guide a demo

How AI Labs Sport Benchmark Scores: Goodhart’s Legislation and Benchmaxxxing

There is a precept from economics referred to as Goodhart’s Legislation: when a measure turns into a goal, it stops being a great measure.

In plain English: the second everybody agrees that GPQA Diamond is the quantity that issues, labs begin optimizing particularly for GPQA Diamond. Scores go up however the real-world functionality might not transfer in any respect.

This has a reputation within the AI group now: benchmaxxxing. It is the apply of compacting each potential level out of a benchmark via methods that enhance the rating with out essentially bettering the mannequin. A few of these methods are official engineering and a few are nearer to the gaming Meta did with LMArena. The road is genuinely blurry, which is a part of what makes this tough to name out.

That’s how benchmaxxxing truly seems like in apply:

Cherry-picking which benchmarks to publish. Each mannequin will get evaluated on dozens of benchmarks internally. Those that seem within the press launch are those the mannequin did effectively on. The remaining disappear. That is common, each lab does it. Llama 4’s ARC-AGI rating of 0.00% was not within the announcement.

Selecting favorable analysis settings. Many benchmarks could be run in several methods: totally different prompting types, totally different numbers of instance questions proven beforehand, totally different temperatures. Labs run all of the variants internally and publish the very best end result. That is technically allowed however not often disclosed.

Coaching on benchmark-adjacent knowledge. If you recognize a benchmark assessments quantum mechanics reasoning, you may make certain your coaching set is heavy on quantum mechanics. The questions themselves aren’t within the coaching knowledge, however the data required to reply them is saturated. That is almost unimaginable to tell apart from real functionality enchancment from the surface.

Benchmark contamination, the intense model. Generally precise benchmark questions, or near-identical variants, find yourself in coaching knowledge. This could occur by accident when coaching on web scrapes. It might probably additionally occur much less by accident. Susan Zhang, a former Meta AI researcher who later moved to Google DeepMind, shared analysis earlier in 2025 documenting how benchmark datasets could be contaminated via coaching corpus overlap. When a mannequin sees the query and reply throughout coaching, it is primarily memorized the take a look at. And the rating displays reminiscence, not reasoning.

Majority voting and repeated sampling. Some labs run every benchmark query a number of instances and take the most typical reply. A mannequin that scores 80% on one try would possibly rating 88% throughout 32 makes an attempt. Meta particularly disclosed they do not do that for Muse Spark’s reported numbers, they use zero temperature, single makes an attempt.

The deepest downside with Goodhart’s Legislation in AI is that it creates a ratchet impact. Every new mannequin must beat the earlier one’s benchmark scores, or it is declared a failure. So each launch will get extra optimized for the benchmarks that exist, which makes these benchmarks much less informative over time, which drives the creation of tougher benchmarks, which then additionally get optimized for. MMLU was the gold customary in 2022 nevertheless it’s saturated now. GPQA Diamond changed it.

What Benchmarks Nonetheless Can’t Inform You

Velocity. GPQA Diamond says nothing about whether or not the mannequin responds in 1 second or 10.

Value. A mannequin scoring 92% at $15 per million tokens versus one scoring 89% at $1 per million tokens are totally different selections relying on how a lot quantity you are working.

Consistency. A mannequin averaging 90% on a benchmark however producing catastrophically flawed solutions 2% of the time is a distinct threat profile from one which scores 85% uniformly. Benchmarks report averages. Averages cover tails.

Your particular job. None of those benchmarks have been designed on your paperwork, your prompts, or your customers. A mannequin that dominates GPQA Diamond would possibly deal with an insurance coverage kind extraction job worse than a smaller, cheaper mannequin skilled on domain-specific knowledge.

Consider AI Fashions for Your Personal Use Case

You’ll be able to truly consider the very best mannequin for you, your self.

Take your ten or twenty most consultant duties: the precise prompts, paperwork, or questions you’d ship to the mannequin in apply. Run each mannequin you are contemplating on these precise inputs. Rating the outputs your self (or have somebody with area experience do it.)

That single customized take a look at will inform you greater than any benchmark desk in a press launch. As a result of benchmarks inform you the place a mannequin claims to face. Your take a look at set tells you the place it truly has to point out up.

Curious to be taught extra?

See how our brokers can automate doc workflows at scale.

Guide a demo

Main Menu

What's Hot

Why AI-Native IDP Outperform Legacy IDPs Doc Workflows

Orange Enterprise Reimagines Enterprise Voice Communications

OpenAI introduces ChatGPT Professional $100 tier with 5X utilization limits for Codex in comparison with Plus

Structure as Code to Train People and Brokers About Structure – O’Reilly

LaCy: What Small Language Fashions Can and Ought to Study is Not Only a Query of Loss

Understanding Amazon Bedrock mannequin lifecycle

Evaluating the Finest AI Video Mills for Social Media

Why AI-Native IDP Outperform Legacy IDPs Doc Workflows

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Why AI-Native IDP Outperform Legacy IDPs Doc Workflows

Orange Enterprise Reimagines Enterprise Voice Communications

OpenAI introduces ChatGPT Professional $100 tier with 5X utilization limits for Codex in comparison with Plus

Do We Want Workplaces Anymore? Truly….Sure!

Main Menu

Subscribe to Updates

What's Hot

GPQA, SWE-bench & Area Elo

What Is an AI Benchmark?

MMLU and MMLU-Professional: The Information Take a look at

GPQA Diamond: The Scientific Reasoning Take a look at

HumanEval: The Primary Coding Take a look at

Curious to be taught extra?

SWE-bench: The Actual Software program Engineering Take a look at

Humanity’s Final Examination: The Frontier Reasoning Take a look at

HealthBench: The Scientific Reasoning Take a look at

Chatbot Area: The Human Choice Take a look at

The Llama 4 Incident

Curious to be taught extra?

How AI Labs Sport Benchmark Scores: Goodhart’s Legislation and Benchmaxxxing

What Benchmarks Nonetheless Can’t Inform You

Consider AI Fashions for Your Personal Use Case

Curious to be taught extra?

Related Posts