Anthropic Just Took the Lead Back — Claude Opus 4.7 Crosses 87% on SWE-bench, and the Numbers Tell a Cleaner Story Than the Hype
For once the "lead retaken" headline survives the benchmark math. Opus 4.7 jumps SWE-bench Verified from 80.8 to 87.6 percent — ahead of Gemini 3.1 Pro at 80.6 — and pulls clear on the metrics that matter to teams shipping code.
Fully Verified
⚡How This Impacts You
The strongest generally available model on the market on coding, vision, and knowledge work — and the first time a frontier lab has openly admitted that what it shipped is not what it has built.
FLASHFEED Desk··Updated: 17 Apr 2026, 23:20:56·4 min read
The release of Claude Opus 4.7 narrowly retakes the throne for the most powerful generally available large language model — and unlike most "lead retaken" press cycles, the benchmark math actually supports the headline. SWE-bench Verified jumps from 80.8 to 87.6 percent, a nearly seven-point gain that puts it ahead of Gemini 3.1 Pro at 80.6. SWE-bench Pro, the harder multi-language coding test, jumps from 53.4 to 64.3. These are not marginal improvements. They are the difference between a model that handles common engineering tasks and a model that handles the messy ones the previous generation regularly stumbled on.
Compare the curve to its predecessor. Opus 4.6 was already the best general-purpose model for agentic coding work in late 2025; the gap to GPT-5.4 was real but contestable. Opus 4.7 widens that gap on the metrics that matter to teams actually shipping code — SWE-bench, MCP-Atlas at 77.3 percent for multi-tool orchestration, and a vision benchmark that jumps from 57.7 to 79.5 percent for visual navigation without tools. Each of those numbers, taken alone, is a normal generational improvement. Taken together, they describe a model that is meaningfully more useful than what came before.
The most underdiscussed metric is GDPVal-AA, the knowledge-work evaluation. Opus 4.7 leads at an Elo of 1753, with GPT-5.4 at 1674 and Gemini 3.1 Pro at 1314. That spread is not a benchmark artifact — it reflects what real users keep observing in side-by-side comparisons. Where coding benchmarks measure what models can do, GDPVal-AA measures what they actually do for the kind of professional work people pay for. The 79-point Elo gap to GPT-5.4 corresponds to roughly a 60-percent win rate in head-to-head matches. The 439-point gap to Gemini 3.1 Pro is, in this kind of evaluation, a generational distance.
Anthropic also conceded something rare in this release — that Opus 4.7 still falls short of its unreleased Mythos preview, available only to a handpicked group of customers. That candor is the part of this launch most worth reading. It signals that the public model is no longer the bleeding edge of what a frontier lab can ship, and that the next public release will likely close the gap. For developers, builders, and the broader market that depends on the strongest available model, Opus 4.7 is the new floor. The ceiling is now closer than it has ever been.