Software That Now Outperforms Humans: Seven Domains AI Quietly Surpassed Us in Q1 2026

AI Innovation Published Apr 27, 2026 · benchmarks · alphaproof · claude opus 4.7 · swe-bench · alphafold

Triumphalist 'AI superhuman' headlines are cheap. Concrete benchmark numbers are not. Below are seven 2026 domains where AI now beats the median credentialed human on a public, reproducible test — with the actual scores, the actual models, and the caveats that matter.

1. Competition mathematics — IMO gold

Domain: International Mathematical Olympiad. Six problems, 4.5 hours, scored 0–7 each.

Result: An advanced version of Gemini Deep Think solved 5 out of 6 problems perfectly at IMO 2025, scoring 35 points — solid gold-medal territory. Critically, Gemini operated end-to-end in natural language, producing rigorous proofs from the official problem statements within the contest's 4.5-hour time limit.

Caveat: Gold-medal humans are 17–18-year-olds with months of training. Gemini Deep Think used "thinking" mode that consumes orders of magnitude more compute than a single forward pass. Both facts are true.

2. Software engineering — SWE-bench Verified

Domain: SWE-bench Verified is the cleaned-up version of SWE-bench, where human Princeton/Stanford reviewers confirm each issue has a correct, testable resolution. The benchmark scores whether an agent can fix real GitHub issues end-to-end.

Result: Claude Opus 4.7 hit 87.6% on SWE-bench Verified (up from Opus 4.6's 80.8%) and 64.3% on the harder SWE-bench Pro — versus 57.7% for GPT-5.4 and 54.2% for Gemini 3.1 Pro. On Cursor's internal CursorBench it jumped from 58% to 70%.

Why it matters: Cognition's Devin team reported the same model can run "for hours" coherently, recovering from tool failures that would have stopped earlier models cold. This is the first quarter where mid-tier engineering tickets (file-touch, write-test, fix-bug-rerun-test) are reliably handled without supervision for a measurable share of issues.

3. Protein design — RFdiffusion3 and Boltz-2

Domain: de novo protein binder design and structure prediction.

Result:

RFdiffusion3 (December 2025) runs roughly 10× faster than the original RFdiffusion and operates at atom-level precision. Roughly half of unconditional designs express in soluble form when validated experimentally — a number that would have been called impossible in 2022.
Boltz-2 (June 2025) can co-fold a protein–ligand pair and output both the 3D complex and a binding-affinity estimate in about 20 seconds on a single GPU. The previous state of the art took 6–12 hours per pair.
AlphaFold 3 models complexes containing proteins, nucleic acids, small molecules, ions, and modified residues — the full molecular machinery of cells, not just protein-only structures.

Why it matters: The Boltz repository now has 1,300+ Slack community members and 200+ biotech adopters. This is no longer a research curiosity; it's the actual screening tier for early-stage drug discovery.

4. Weather forecasting — GraphCast and successors

Domain: 10-day global weather forecasts.

Result: DeepMind's GraphCast (and its 2025 successor with finer resolution) outperforms the European Centre for Medium-Range Weather Forecasts (ECMWF) HRES — the gold-standard physics-based model — on 90%+ of variables on the 10-day forecast. The model runs on a single TPU in under a minute, vs HRES's hours of supercomputer time.

Why it matters: ECMWF and the UK Met Office now run AI models in production alongside physics-based ones. This is the rare case where AI is faster and more accurate.

5. Radiology — narrow tasks

Domain: chest-CT lung-nodule detection, mammogram BI-RADS classification, brain-MRI tumor segmentation.

Result: Multiple FDA-cleared models (Aidoc, Lunit, Annalise.ai) match or exceed median radiologist performance on specific narrow tasks, and reduce miss rates when used as a second reader. The 2025 NHS UK breast-screening trial confirmed the second-reader improvement holds in real clinical workflow.

Caveat: "Match a radiologist" on a narrow task is not the same as "be a radiologist." The current state is human-AI collaboration outperforming either alone — which is still a real productivity win.

6. Formal theorem proving — AlphaProof + Lean 4

Domain: proving olympiad-level mathematical statements as machine-verified Lean 4 proofs.

Result: AlphaProof (Google DeepMind) solved 3 of the 4 IMO 2024 problems eligible for formal proof, including the hardest problem in the contest (solved by only 5 contestants). The system trains via reinforcement learning by generating and proving (or disproving) millions of self-generated variations.

Why it matters: Formal proofs are the gold standard of mathematical rigor — they cannot be wrong. AlphaProof shifted the question from "can AI do math" to "can AI write proofs that survive a machine checker." Yes, it now can.

7. Strategic games — DeepMind's continuing dominance

Domain: Go, chess, Dota 2, StarCraft II, no-limit poker — all long-since superhuman. The 2026 update is Diplomacy (the seven-player negotiation game): AI agents that combine an LLM negotiator with a classical search-based strategy module now beat human champions in the no-press variant and play at expert level in full-press. The combination is the interesting part — the LLM handles persuasion in natural language, the search module handles tactics.

What ties them together

Every benchmark above shares a structural pattern:

The task is well-specified (clear inputs, clear scoring).
There's a way to generate a lot of synthetic training data (self-play, simulator, formal verifier).
An RL or search-on-top-of-LLM loop closes the gap to human expert performance.

This is the recipe. It generalizes to any domain that fits all three. It does not generalize to ill-specified, poorly-scored tasks — which is why "AI replaces a generalist office worker" remains contested while "AI passes IMO" is now boring.

What it doesn't yet do

Equally important: long-horizon real-world tasks where the goal is fuzzy (write a great novel, run a successful company, conduct an original scientific research program from question to peer-reviewed result), where models still fall well short. The benchmarks above are local superhuman performance, not global.

The forward-looking benchmarks to watch

FrontierMath — graduate-level open math problems. Q1 2026 best is still ~30%; this is the next IMO-grade target.
Humanity's Last Exam — a multidisciplinary expert benchmark. Top scores rose from ~10% in mid-2024 to ~45% in early 2026. Expect a 60%+ score by year-end.
SWE-bench Multimodal — adds UI screenshots to the bug-fix task. Currently a wide open gap between text-only and screenshot-aware agents.
AlphaFold 4 rumors — supposed to model conformational dynamics, not just static structures. Would be the next biotech step-change.

Frequently asked

What does it mean that Claude Opus 4.7 hit 87.6% on SWE-bench Verified?

SWE-bench Verified is a cleaned subset of real GitHub issues from popular open-source Python repositories where Princeton and Stanford reviewers confirmed each issue has a clear, testable resolution. 87.6% means the agent successfully closed 87.6% of those issues to a passing test. This is up from 80.8% in Claude Opus 4.6 and represents the first time mid-tier engineering work is reliably automatable for a measurable share of tickets.

Did AI actually win an IMO gold medal?

Yes — an advanced version of Google DeepMind's Gemini Deep Think solved 5 of 6 problems at IMO 2025 perfectly, earning 35 points (gold-medal range). It produced rigorous natural-language proofs end-to-end within the 4.5-hour competition window. Earlier in 2024, AlphaProof and AlphaGeometry 2 hit silver-medal level (28 points).

How is AI changing weather forecasting?

DeepMind's GraphCast (and successors) outperform the European Centre's HRES — the previous gold-standard physics-based model — on more than 90% of variables in 10-day forecasts. The AI model runs on a single TPU in under a minute, versus hours of supercomputer time for HRES. ECMWF and the UK Met Office now run both in production.

What is Boltz-2 and why does it matter?

Boltz-2 (released June 2025) is a model that takes a protein and a candidate drug-like ligand and predicts both their 3D complex structure and the binding affinity in about 20 seconds on a single GPU. The previous state of the art required 6–12 hours per pair. This collapses the cost of early-stage virtual screening for drug discovery by roughly three orders of magnitude.

What can AI still not do better than humans in 2026?

Long-horizon, fuzzy, multi-stakeholder real-world work: writing a great novel, running a successful company end-to-end, conducting original peer-reviewed scientific research, or making a contested business decision under uncertainty. The benchmarks where AI is now superhuman all share three properties — well-specified task, generatable training data, and a clear scoring function. Tasks lacking those properties remain firmly in the human zone.

Sources & further reading

Last reviewed Apr 27, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.