HLE Explained: Humanity's Last Exam for AI Models

By early 2025, frontier models had effectively saturated MMLU and were closing in on the human-expert ceiling on GPQA. The reasoning-benchmark community needed something harder — a test that couldn’t be aced for years, not months. That’s the gap Humanity’s Last Exam (HLE) was built to fill. It’s now a fixture on every serious leaderboard, including the AI Models Benchmark table, and the score is one of the cleanest single numbers we have for “how close is this model to expert-level knowledge across everything.” This post explains what HLE actually is, how it’s graded, and what the headline number does and doesn’t mean.

What HLE actually is

Humanity’s Last Exam is a roughly 3,000-question multimodal benchmark released in January 2025 by the Center for AI Safety and Scale AI, with contributions from nearly 1,000 subject-matter experts across more than 500 institutions in over 50 countries. The name is deliberately provocative — the project’s premise is that this should be the last academic-knowledge benchmark the field needs, because anything a frontier model can ace on HLE is something it has genuinely mastered.

A few design decisions distinguish it from earlier benchmarks:

Breadth. Over 100 subjects, spanning math, physics, chemistry, biology, computer science, classics, law, linguistics, humanities, and the long tail of academic specializations.
Multimodal. Roughly 14% of questions include images — diagrams, figures, manuscript scans — that the model must actually read to answer correctly.
Crowdsourced from experts only. Every question was written by someone with credentialed expertise in the subfield it covers, then validated by other experts. Contributors were paid, with substantial bonuses for the questions that survived review.
Closed-ended but not pure multiple choice. A mix of multiple-choice and short-form exact-answer questions, all designed so an automated grader can score them without subjective judgment.

The headline target at launch: the best frontier model scored in the single digits. That’s the whole point — HLE was built with enough headroom that the saturation curve would take years, not months.

How HLE is actually graded

Grading is mechanical but more nuanced than AIME’s integer-match or GPQA’s letter-pick:

Multiple choice questions are graded by exact letter match, like GPQA.
Short-form questions are graded by checking the model’s final answer against a curated set of acceptable answer strings.
Image questions are graded the same way as text — what changes is the input modality, not the output.

The short-form grading is where HLE gets interesting. A correct answer phrased slightly differently from the gold reference can be marked wrong, which means the benchmark slightly under-rewards models that produce verbose or non-standard outputs. Frontier labs report this explicitly: vendor numbers usually use a normalized-string grader plus a fallback LLM-as-judge step to catch obvious paraphrases. A 25% score under one grader can be a 27% under another. Treat the methodology footnote as load-bearing.

Why HLE scores look so different across models

The biggest single driver of HLE score variation is tool use, and it’s the thing most likely to confuse a casual reader of a leaderboard.

No tools. The model answers from internal knowledge alone. Lowest scores; cleanest signal of what the model has actually internalized.
With code execution. The model can run Python to compute, simulate, or look things up. Inflates math and physics scores noticeably.
With web search / browsing. The model can look up factual answers. Inflates retrieval-heavy questions massively — and arguably misses the point of the benchmark, since “can the model find the answer on the web” is a different question than “does the model know the answer.”

Vendors usually disclose which configuration produced the headline number. The same model can score 20% no-tools and 35%+ with full agentic tool use on the same benchmark. Cross-model comparisons need matched configurations — comparing one vendor’s no-tools run against another’s tool-use run will mislead you every time.

A “HLE: 32%” headline can mean a tool-using agent loop with web access, or a single-shot reasoning model with nothing but its weights. The number alone doesn’t tell you which.

The other major driver is reasoning budget. Reasoning-tuned models (o-series, Claude with thinking, Gemini Thinking) pull ahead of standard models on HLE more dramatically than on AIME or GPQA, because the questions reward sustained multi-step deliberation more than fast pattern-matching.

What an HLE score does and doesn’t tell you

A strong HLE score genuinely signals:

Broad expert-level knowledge. Most questions are unanswerable without specific subfield expertise. A model that scores well has internalized a lot of niche territory.
Multimodal competence. Image questions force the model to actually read figures, not just guess from context.
Reasoning across long-tail domains. GPQA covers three sciences. HLE covers everything from organic chemistry to medieval Latin paleography.

What it doesn’t tell you:

Agentic capability. HLE is one-shot Q&A, not a multi-step task. Long-horizon agentic work is measured by SWE-Bench Verified and similar benchmarks.
Mathematical reasoning specifically. AIME 2025 is a much sharper instrument for that capability.
Whether the model has been contaminated. HLE has been on the public internet since January 2025. By now, some leakage into newer models’ training data is essentially certain. Scores from models trained before mid-2025 are the cleanest.
Calibration. A model can score 30% on HLE and still be confidently wrong on the other 70% — overconfidence at the frontier of knowledge is a real failure mode.

How to read HLE on a leaderboard

Practical rules when comparing models on this metric:

Confirm the tool-use configuration. “No tools” vs “with web search” is the single biggest source of confusion. Default to no-tools numbers for cross-model comparisons.
Check for the multimodal subset. Some leaderboards report text-only HLE separately from full HLE. Mixing them is apples-to-oranges.
Don’t anchor on small gaps. With 3,000 questions HLE is statistically more stable than AIME, but methodology variance still matters more than single-digit point gaps.
Pair it with the other reasoning benchmarks. A model that’s strong on HLE but weak on AIME is a knowledge-heavy model that doesn’t reason fluently. The opposite profile shows up too. Both are useful information.
Watch the trajectory. From low single digits at launch in early 2025 to scores well above 30% on the strongest configurations a year later — the curve is steeper than the original “this will last for years” framing predicted.

What’s next for HLE

HLE was built to be the benchmark that doesn’t saturate for years. The frontier has chewed through the headroom faster than expected, especially with tool-using configurations, but the no-tools track still has substantial room. Expect HLE to remain the canonical broad-knowledge metric until either contamination accumulates past usable levels or scores cross some informal “human expert across all fields” threshold — neither of which is imminent.

The successors already starting to appear focus on what HLE doesn’t measure: long-horizon agentic tasks, open-ended reasoning that resists exact-match grading, and adversarial questions designed to defeat retrieval. For now, HLE remains the cleanest single number for “does this model know what humans collectively know.”

If that’s the capability you care about, start with the HLE column on the leaderboard — and read it with the configuration caveats above in mind. The benchmarks tag archive collects the rest of the methodology series.