benchmarks 5 min read

GPQA Explained: The Graduate-Level Reasoning Benchmark

GPQA is a graduate-level science benchmark designed to be unsolvable by Google search alone. Here's what the score actually means and how to use it.

If you’ve spent more than ten minutes on any LLM leaderboard in the last year, you’ve seen GPQA. It sits next to MMLU and AIME in the headline columns, usually reported as “GPQA Diamond.” It’s one of the few benchmarks frontier labs still cite without flinching — but the score is loaded with assumptions most readers never unpack. This post explains what GPQA tests, why “Diamond” matters, and how to read the number on the AI Models Benchmark leaderboard without overclaiming.

What GPQA actually is

GPQA — short for Graduate-Level Google-Proof Q&A — is a 448-question multiple-choice benchmark introduced by Rein et al. in 2023. Every question was written by a domain expert with a PhD or active PhD candidacy in biology, physics, or chemistry, then validated by other experts in the same subfield. Each question has four answer choices, so random guessing yields 25%.

What makes GPQA distinctive isn’t the difficulty — plenty of benchmarks are hard. It’s the “Google-proof” property. The authors paid non-expert validators (PhDs in other scientific fields) to attempt the questions with unrestricted internet access and over 30 minutes per question. Those non-experts scored roughly 34% — barely above chance. In-domain experts, by contrast, scored around 65%.

That gap is the entire point of the benchmark. A high GPQA score is supposed to mean the model is doing something closer to scientific reasoning than to retrieval.

Diamond, Main, and Extended — what subset are you looking at?

GPQA ships in three nested subsets, and conflating them is the most common reporting error:

  • GPQA Extended (546 questions) — the full pool, including questions that didn’t fully clear expert validation.
  • GPQA Main (448 questions) — the standard set used in the original paper.
  • GPQA Diamond (198 questions) — the strictest subset, where both expert validators agreed on the correct answer and at least one disagreed with the question’s “tricky” phrasing only after careful work.

Almost every modern leaderboard reports GPQA Diamond. It’s the hardest subset, the most contamination-resistant, and the one where the human-expert baseline is most reliable. When you see a single “GPQA” column on a comparison table — including the leaderboard here — assume Diamond unless explicitly stated otherwise.

The size matters. With only 198 questions, a single problem is worth roughly half a percentage point. Run-to-run variance from temperature, prompt formatting, and answer-extraction logic can easily move the score by 2–3 points. Treat sub-3-point gaps as noise.

How models are actually scored on GPQA

The grading itself is mechanical: the model picks A, B, C, or D, and the answer is either right or wrong. The interesting variation is in how the model is allowed to answer. Common configurations include:

  • Zero-shot, no chain-of-thought. The model answers directly. Lowest scores; rarely reported by frontier labs anymore.
  • Zero-shot with chain-of-thought. The model is asked to reason step-by-step before answering. The default for most modern reports.
  • Majority vote (maj@k). The model is sampled multiple times and the most common answer wins. Inflates the score noticeably; usually disclosed as maj@8, maj@32, etc.
  • Test-time compute / extended thinking. Reasoning-tuned models (o-series, Claude with thinking, Gemini Thinking) get larger token budgets to deliberate. Often the highest-reported configuration.

A “GPQA: 84%” headline can mean anything from a single-pass chain-of-thought run to a 32-sample majority vote with extended thinking. The number alone is meaningless without the configuration.

This is why cross-vendor comparisons need to be read carefully. When two labs report different numbers for the same model, the gap is often a methodology gap, not a capability gap.

What a GPQA score does and doesn’t tell you

Things a strong GPQA Diamond score genuinely signals:

  • Graduate-level scientific knowledge across multiple disciplines. The model has internalized concepts from molecular biology, organic chemistry, quantum mechanics, and more.
  • Multi-step reasoning under multiple-choice constraints. Most questions can’t be solved by pattern-matching the surface form of the prompt.
  • Resistance to plausible distractors. GPQA’s wrong answers are written by experts to be tempting — they’re the kinds of mistakes a smart non-expert would make.

What it doesn’t tell you:

  • Whether the model can do open-ended scientific reasoning. Multiple choice puts a ceiling on what the format can measure. Generating a hypothesis is harder than picking one.
  • Whether the model is contamination-free. GPQA was published in late 2023; questions and answers have circulated online since. Some leakage into training corpora is essentially guaranteed at this point.
  • Anything about non-science domains. GPQA covers biology, physics, and chemistry. Math, code, law, finance, and the humanities are all elsewhere — see SWE-Bench Verified for the equivalent number on coding agents.
  • Calibration or confidence. A model that gets 80% on GPQA can still be confidently wrong on the other 20%.

How to read GPQA on a leaderboard

Practical rules when comparing models on this metric:

  • Confirm the subset. Default to Diamond; flag any score that isn’t.
  • Check the configuration. A maj@32 number with extended thinking is not comparable to a single-shot chain-of-thought number from the same model.
  • Treat small gaps as noise. With 198 questions, ±2–3 points is statistical fuzz.
  • Pair it with other reasoning benchmarks. GPQA correlates well with AIME 2025 and HLE on the frontier — divergence between them is a useful signal that something model-specific is going on.
  • Discount over time. As contamination grows, today’s GPQA score is a slightly weaker signal than it was a year ago. The benchmark hasn’t been refreshed.

What’s next for GPQA

GPQA worked exactly as intended for about eighteen months: the human-non-expert ceiling (~34%) was crossed quickly, the in-domain expert level (~65%) was crossed by frontier reasoning models in 2024, and several models now score above the in-domain expert baseline. That’s the saturation pattern every benchmark eventually hits.

The successors — Humanity’s Last Exam, FrontierMath, and a growing crop of agentic science benchmarks — are designed to push past the multiple-choice ceiling and to be harder to contaminate. Expect GPQA to stay on leaderboards as a familiar reference point even as the action moves elsewhere.

For now, GPQA Diamond remains the cleanest single number for “does this model reason like a graduate-level scientist.” If that’s the capability you’re shopping for, start with the GPQA column on the leaderboard — and read it with the configuration caveats above in mind. The benchmarks tag archive has the rest of the methodology series.