SWE-bench Verified: The Real-World GitHub Issue Test
SWE-bench Verified is the benchmark that matters most for production coding use cases. Unlike HumanEval — which asks a model to complete a Python function stub in isolation — SWE-bench tests whether a model can resolve real GitHub issues against a real codebase, with full file trees, failing tests, and no scaffolding. The Verified subset filters out tasks with flawed or ambiguous ground truth, making it the most reliable signal of actual autonomous coding capability.
As of June 2026, GPT-5 and Claude Opus 4.1 are statistically neck-and-neck on SWE-bench Verified, both reporting resolution rates in the low 70s percent range. OpenAI's GPT-5 system card documents a ~72% score on SWE-bench Verified under their agent scaffolding. Anthropic's Claude Opus 4 model card reports ~72.5% on the same benchmark. These figures are measured under specific scaffolding conditions chosen by each provider, which means they aren't apples-to-apples — the agent harness, tool set, and number of attempts are all controlled variables that providers don't fully disclose.
The practical implication: on multi-file, multi-step coding tasks — the kind you'd actually ship an agentic coding assistant to handle — both models are operating at roughly the same capability frontier. The differentiators for your specific use case will be context window (Claude Opus 4.1's 200k vs GPT-5's 128k matters for large repos), latency (GPT-5 is generally faster to first token), and cost (Claude Opus 4.1 is 50% more expensive on output tokens). Gemini 2.5 Pro trails on SWE-bench at approximately 63% but compensates with a 1M-token context window that no competitor matches — relevant for monorepo work where the whole codebase needs to fit in context.