Software That Now Outperforms Humans: Seven Domains AI Quietly Surpassed Us in Q1 2026
Triumphalist 'AI superhuman' headlines are cheap. Concrete benchmark numbers are not. Below are seven 2026 domains where AI now beats the median credentialed human on a public, reproducible test — with the actual scores, the actual models, and the caveats that matter.
1. Competition mathematics — IMO gold
Domain: International Mathematical Olympiad. Six problems, 4.5 hours, scored 0–7 each.
Result: An advanced version of Gemini Deep Think solved 5 out of 6 problems perfectly at IMO 2025, scoring 35 points — solid gold-medal territory. Critically, Gemini operated end-to-end in natural language, producing rigorous proofs from the official problem statements within the contest's 4.5-hour time limit.
Caveat: Gold-medal humans are 17–18-year-olds with months of training. Gemini Deep Think used "thinking" mode that consumes orders of magnitude more compute than a single forward pass. Both facts are true.
2. Software engineering — SWE-bench Verified
Domain: SWE-bench Verified is the cleaned-up version of SWE-bench, where human Princeton/Stanford reviewers confirm each issue has a correct, testable resolution. The benchmark scores whether an agent can fix real GitHub issues end-to-end.
Result: Claude Opus 4.7 hit 87.6% on SWE-bench Verified (up from Opus 4.6's 80.8%) and 64.3% on the harder SWE-bench Pro — versus 57.7% for GPT-5.4 and 54.2% for Gemini 3.1 Pro. On Cursor's internal CursorBench it jumped from 58% to 70%.
Why it matters: Cognition's Devin team reported the same model can run "for hours" coherently, recovering from tool failures that would have stopped earlier models cold. This is the first quarter where mid-tier engineering tickets (file-touch, write-test, fix-bug-rerun-test) are reliably handled without supervision for a measurable share of issues.
3. Protein design — RFdiffusion3 and Boltz-2
Domain: de novo protein binder design and structure prediction.
Result:
- RFdiffusion3 (December 2025) runs roughly 10× faster than the original RFdiffusion and operates at atom-level precision. Roughly half of unconditional designs express in soluble form when validated experimentally — a number that would have been called impossible in 2022.
- Boltz-2 (June 2025) can co-fold a protein–ligand pair and output both the 3D complex and a binding-affinity estimate in about 20 seconds on a single GPU. The previous state of the art took 6–12 hours per pair.
- AlphaFold 3 models complexes containing proteins, nucleic acids, small molecules, ions, and modified residues — the full molecular machinery of cells, not just protein-only structures.
Why it matters: The Boltz repository now has 1,300+ Slack community members and 200+ biotech adopters. This is no longer a research curiosity; it's the actual screening tier for early-stage drug discovery.
4. Weather forecasting — GraphCast and successors
Domain: 10-day global weather forecasts.
Result: DeepMind's GraphCast (and its 2025 successor with finer resolution) outperforms the European Centre for Medium-Range Weather Forecasts (ECMWF) HRES — the gold-standard physics-based model — on 90%+ of variables on the 10-day forecast. The model runs on a single TPU in under a minute, vs HRES's hours of supercomputer time.
Why it matters: ECMWF and the UK Met Office now run AI models in production alongside physics-based ones. This is the rare case where AI is faster and more accurate.
5. Radiology — narrow tasks
Domain: chest-CT lung-nodule detection, mammogram BI-RADS classification, brain-MRI tumor segmentation.
Result: Multiple FDA-cleared models (Aidoc, Lunit, Annalise.ai) match or exceed median radiologist performance on specific narrow tasks, and reduce miss rates when used as a second reader. The 2025 NHS UK breast-screening trial confirmed the second-reader improvement holds in real clinical workflow.
Caveat: "Match a radiologist" on a narrow task is not the same as "be a radiologist." The current state is human-AI collaboration outperforming either alone — which is still a real productivity win.
6. Formal theorem proving — AlphaProof + Lean 4
Domain: proving olympiad-level mathematical statements as machine-verified Lean 4 proofs.
Result: AlphaProof (Google DeepMind) solved 3 of the 4 IMO 2024 problems eligible for formal proof, including the hardest problem in the contest (solved by only 5 contestants). The system trains via reinforcement learning by generating and proving (or disproving) millions of self-generated variations.
Why it matters: Formal proofs are the gold standard of mathematical rigor — they cannot be wrong. AlphaProof shifted the question from "can AI do math" to "can AI write proofs that survive a machine checker." Yes, it now can.
7. Strategic games — DeepMind's continuing dominance
Domain: Go, chess, Dota 2, StarCraft II, no-limit poker — all long-since superhuman. The 2026 update is Diplomacy (the seven-player negotiation game): AI agents that combine an LLM negotiator with a classical search-based strategy module now beat human champions in the no-press variant and play at expert level in full-press. The combination is the interesting part — the LLM handles persuasion in natural language, the search module handles tactics.
What ties them together
Every benchmark above shares a structural pattern:
- The task is well-specified (clear inputs, clear scoring).
- There's a way to generate a lot of synthetic training data (self-play, simulator, formal verifier).
- An RL or search-on-top-of-LLM loop closes the gap to human expert performance.
This is the recipe. It generalizes to any domain that fits all three. It does not generalize to ill-specified, poorly-scored tasks — which is why "AI replaces a generalist office worker" remains contested while "AI passes IMO" is now boring.
What it doesn't yet do
The forward-looking benchmarks to watch
- FrontierMath — graduate-level open math problems. Q1 2026 best is still ~30%; this is the next IMO-grade target.
- Humanity's Last Exam — a multidisciplinary expert benchmark. Top scores rose from ~10% in mid-2024 to ~45% in early 2026. Expect a 60%+ score by year-end.
- SWE-bench Multimodal — adds UI screenshots to the bug-fix task. Currently a wide open gap between text-only and screenshot-aware agents.
- AlphaFold 4 rumors — supposed to model conformational dynamics, not just static structures. Would be the next biotech step-change.
Frequently asked
What does it mean that Claude Opus 4.7 hit 87.6% on SWE-bench Verified?
Did AI actually win an IMO gold medal?
How is AI changing weather forecasting?
What is Boltz-2 and why does it matter?
What can AI still not do better than humans in 2026?
Sources & further reading
- Claude Opus 4.7 Benchmarks Explained — Vellum
- Gemini Deep Think IMO gold-medal — Google DeepMind
- Olympiad-level formal mathematical reasoning — Nature
- AlphaFold 3 — Oxford Precision Clinical Medicine
- Protein engineering with AI: OpenFold3 vs Boltz 2 vs AlphaFold 3
- 6 ways AI reshaped scientific software in 2025 — R&D World
Last reviewed Apr 27, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.