AIME 2025 Explained: The Math Benchmark for AI Reasoning

If you’ve watched any frontier model launch in the last year, you’ve seen the same chart: a bar labelled “AIME 2025” climbing past 90%. It’s the metric every reasoning-tuned model is graded on, and it’s the cleanest single number we have for “can this model do hard math.” But the score hides a lot — including why 2025 specifically matters and why a 3-point gap is almost never meaningful. This post explains what AIME is, why labs report it with a year suffix, and how to read it on the AI Models Benchmark leaderboard without overclaiming.

What AIME actually is

The American Invitational Mathematics Examination is a high school math competition administered by the Mathematical Association of America since 1983. It’s the second stage of the USA Math Olympiad selection process — only the top ~5% of students from the AMC 10/12 even qualify to take it.

The format is unusually friendly to automated grading:

15 questions, 3 hours.
Each answer is an integer from 000 to 999. No multiple choice. No partial credit. The answer is right or it’s wrong.
Two versions a year — AIME I in early February, AIME II about a week later. Most LLM evaluations use both, giving 30 problems total.

The questions span algebra, geometry, number theory, and combinatorics. They’re not designed to test exotic knowledge — they test whether you can reason through a multi-step problem and arrive at a single integer. That’s exactly the shape of evaluation an LLM grader can handle without subjective scoring, which is why it took over the reasoning column on every leaderboard.

Why “2025” — the contamination story

The original AIME problem set has been online for decades, with full solutions, in dozens of languages. Any model trained on a recent web crawl has almost certainly memorized large parts of it. A score on “AIME” without a year is essentially a memorization test.

That’s why frontier labs report AIME 2025 specifically. The 2025 contests took place on February 6 and February 12, 2025, after the training cutoff of every model evaluated in early 2025. That makes the 30 problems a contamination-resistant test of reasoning rather than recall — at least until they’ve been online long enough to leak into the next training cycle.

This is also why you’ll see leaderboards refresh the year over time. AIME 2024 was the standard for most of 2024; AIME 2025 took over in early 2025; AIME 2026 will displace it as soon as the contest runs and labs validate clean scores. The year suffix is the benchmark’s contamination control, not a versioning quirk.

When you see “AIME 2025: 92%” on a model card, the score is making one specific claim: this model can solve graduate-quality contest math problems it has provably never seen.

How models are actually scored on AIME

The grading is the easy part — exact-match against an integer. The interesting variation is in how the model is allowed to answer:

pass@1 — the model gets one attempt per problem. Cleanest, most honest, and what most production deployments will see.
maj@k / cons@k — the model is sampled k times, and the most common answer wins. maj@32 and maj@64 are common in reasoning-model reports. Inflates scores noticeably and costs k× the tokens.
Extended thinking budgets. Reasoning-tuned models (o-series, Claude with thinking, Gemini Thinking) get to spend tens of thousands of internal tokens deliberating per problem. Higher budgets = higher scores, with diminishing returns.

A “92%” headline can mean very different things across these configurations. A pass@1 zero-shot number is not comparable to a maj@64 number with extended thinking on the same model. Vendors usually disclose the configuration in a footnote; cross-model comparisons that don’t match configuration are essentially apples-to-oranges.

There’s one statistical wrinkle worth flagging: with only 30 problems, every problem is worth ~3.3 points. Run-to-run variance from sampling temperature alone can move the score by 5+ points. Treat sub-5-point gaps as noise unless the configuration is identical and the run was repeated.

What an AIME 2025 score does and doesn’t tell you

A strong AIME 2025 score genuinely signals:

Multi-step mathematical reasoning. Most problems take 5–15 reasoning steps, often with a non-obvious key insight.
Working with discrete structures. Combinatorics and number theory problems reward careful enumeration, which is exactly where weak reasoners fall apart.
Reasoning under closed-form constraints. Knowing the answer must be an integer 000–999 disciplines the search — strong models exploit it; weak models don’t.

What it doesn’t tell you:

Whether the model can do open-ended or proof-based math. AIME has integer answers; FrontierMath and Putnam-style benchmarks evaluate proof construction, which is a much harder regime.
Anything about non-math reasoning. A model that aces AIME can still struggle with graduate-level science or real-world coding tasks — the skills overlap less than you’d think.
Production behavior. Production users rarely give the model a 30,000-token thinking budget per query. Pass@1 numbers are closer to what you’ll actually deploy with.
Long-horizon planning. AIME problems are bounded — the answer is always a few minutes of reasoning away. Multi-hour agent tasks are a different benchmark.

How to read AIME on a leaderboard

Practical rules when comparing models on this metric:

Confirm the year. If the leaderboard says “AIME” with no suffix, ask. AIME 2024 numbers are not comparable to AIME 2025.
Confirm the configuration. Pass@1 vs maj@k changes everything. Reasoning-budget settings change a lot.
Treat 5-point gaps as noise. With 30 problems and sampling variance, smaller gaps don’t mean what you think they mean.
Look at the gap between pass@1 and maj@k. A model whose maj@64 is much higher than its pass@1 is a model that can find the right answer but isn’t confident about it — useful information for production deployment.
Pair it with GPQA and SWE-Bench. Strong reasoning models tend to climb all three together. Divergence is a useful signal that the model has a specific strength or weakness.

What’s next

AIME’s role on leaderboards is secure for as long as the annual refresh keeps it ahead of contamination. The bigger question is what comes next when the frontier saturates AIME entirely — pass@1 scores in the high 90s on a 30-problem test mean the benchmark has effectively run out of headroom. The successors already exist: HMMT 2025, FrontierMath, and Putnam-AXIOM push into harder territory and (in FrontierMath’s case) into proof-style problems that LLM graders can’t yet score automatically.

For now, AIME 2025 remains the cleanest single number for “does this model reason mathematically.” If that’s the capability you care about, start with the AIME column on the leaderboard — and read it with the configuration caveats above in mind. The benchmarks tag archive collects the rest of the methodology series.