benchmarks
HLE Explained: Humanity's Last Exam for AI Models
Humanity's Last Exam is a 3,000-question benchmark designed to outlast frontier AI models. Here's what HLE actually tests and how to read the score.
- #benchmarks
- #hle
- #reasoning
4 posts
Humanity's Last Exam is a 3,000-question benchmark designed to outlast frontier AI models. Here's what HLE actually tests and how to read the score.
SWE-Bench Verified is the benchmark that grades AI coding agents on real GitHub issues. Here's what it tests, what it misses, and how to read the scores.
AIME 2025 is the high school math competition that frontier AI models now use as a contamination-resistant reasoning benchmark. Here's how to read the scores.
GPQA is a graduate-level science benchmark designed to be unsolvable by Google search alone. Here's what the score actually means and how to use it.