SWE-Bench Verified: How AI Coding Agents Are Measured

Pick any leaderboard for AI coding ability and one column keeps showing up: SWE-Bench Verified. It’s the closest thing the field has to a shared scoreboard for agents that touch real codebases — but the score is easy to misread. This post covers what the benchmark actually measures, where it’s strict, where it’s generous, and how to use the numbers on the AI Models Benchmark leaderboard without overclaiming.

What SWE-Bench Verified actually is

SWE-Bench Verified is a 500-problem subset of the original SWE-Bench dataset, hand-curated by human annotators at OpenAI in August 2024 to fix systematic flaws in the original 2,294-task benchmark. The original SWE-Bench paper (Jimenez et al., 2023) sourced real bug reports and feature requests from 12 popular Python repositories — Django, scikit-learn, sympy, requests, and others — turning each one into an evaluation task: given the repo at the commit just before the fix and the natural-language issue description, produce a patch that makes the hidden test cases pass.

The Verified subset exists because researchers and vendors found the full benchmark contained problems that were unsolvable for reasons unrelated to model capability: tests that depended on environment setup the agent couldn’t see, issue descriptions that left out critical information, or test files that asserted behavior orthogonal to what the issue actually asked for. OpenAI hired professional Python developers to manually review every task and keep only the ones a competent human engineer could plausibly solve from the information provided.

The result: 500 tasks where a passing score is a meaningful signal of coding ability rather than a measure of how well the model guesses underspecified intent.

How the evaluation loop works

Each task hands the agent:

A snapshot of the repository at the commit immediately before the bug fix
The text of the original GitHub issue
A Docker environment where the code can be installed and tested

The agent’s job is to produce a unified diff that, when applied, makes the hidden test cases pass without breaking any existing ones. Grading is binary per task — resolved or not resolved — and the headline score is the percentage of the 500 problems an agent resolves.

Crucially, the agent does not see the test cases. It has to infer the correct behavior from the issue description and the surrounding code. That’s the whole point: production engineers don’t get to read the test before fixing the bug either.

issue description + repo @ pre-fix commit
            │
            ▼
       agent loop  ──▶  unified diff
            │
            ▼
   apply patch + run hidden tests
            │
            ▼
       resolved? (0 / 1)

What the SWE-Bench Verified score does and doesn’t tell you

A high SWE-Bench Verified score signals real things:

The model can navigate a codebase it didn’t write. Most tasks require reading multiple files to find the actual bug.
It can produce syntactically valid, minimally invasive patches. Fixes that resolve the issue but break other tests don’t count.
It can reason about hidden constraints. Tests verify behavior the issue description doesn’t always spell out.

But the score also leaves a lot uncovered:

Python only. Every task is from a Python repo. JavaScript, Go, Rust, and the long tail of polyglot enterprise stacks are absent.
Issue-to-patch only. No greenfield design, no architecture decisions, no debugging from a stack trace alone.
Scaffolding matters as much as the model. Headline scores are usually “model X with agent harness Y.” Swap the harness and the score can move 10–20 points.
Test coverage is uneven. Some tasks have a single asserting test; others have dozens. A patch that’s “almost right” gets the same zero as a patch that’s nonsense.

A 70% on SWE-Bench Verified does not mean the model resolves 70% of the bugs in your codebase. It means it resolves 70% of a specific set of well-curated Python issues, with a specific harness, in a specific evaluation environment.

Why “with scaffolding” matters

When you read that a model scores X% on SWE-Bench Verified, the score is almost always a tuple: (model, agent harness, prompt, allowed turns, allowed tools). Anthropic, OpenAI, and the open-source community all publish numbers using their own harnesses. Cross-vendor comparisons are imperfect because the harness handles things like:

File search and code navigation tools
The retry policy when a patch fails
How long the agent is allowed to plan before acting
Whether the agent can run the existing tests during the loop

The fairest comparisons fix the harness and vary the model. The official SWE-Bench leaderboard publishes both “with harness” and “raw model” tracks, and the AI Models Benchmark table reports the headline number vendors publish — which is generally the best agentic configuration they’ve found.

How to read the numbers on a leaderboard

A few practical rules when comparing models on this metric:

Treat single-digit gaps as noise. Run-to-run variance and harness differences eat small leads.
Look at the trend, not the absolute level. The frontier has moved from roughly 12% in early 2024 to over 70% in the strongest 2025 configurations. The trajectory is informative even when individual scores aren’t strictly comparable.
Pair the score with cost. A model that hits 65% at $3 / M tokens is often a better agent than one that hits 68% at $15 / M. The pricing columns on the leaderboard matter here.
Don’t ignore newer variants. SWE-Bench Live and SWE-Bench Multilingual address the staleness and language-coverage gaps. They’ll matter more over the next year.

What’s next for the benchmark

SWE-Bench Verified solved the “is the task solvable?” problem but not the “is the task representative?” problem. The community is iterating: SWE-Bench Live continuously refreshes the task pool to defeat training-set contamination, and SWE-Bench Multilingual extends evaluation beyond Python. Expect the headline metric you watch a year from now to be a successor, not Verified itself.

Until then, Verified is the best general-purpose number we have for “can this model fix bugs in real code.” If you’re choosing a model for code-heavy agentic work, start with the SWE-Bench column on the leaderboard — and read it with the caveats above in mind. For more methodology notes and benchmark explainers, the benchmarks tag archive collects them in one place.