Where Open-Weight Has Caught Closed: Llama 4, Qwen 3, DeepSeek, and Magistral Compared

AI Innovation Published May 01, 2026 · open-weight models · llama 4 · qwen 3 · deepseek · mistral magistral

By mid-2025, the benchmark gap between the best open-weight language models and OpenAI's GPT-4o had narrowed to the point where a careful analyst had to squint to see it on coding and structured-reasoning tasks. That milestone — projected for years, raced against by incumbents — now defines the competitive landscape that builders, enterprises, and AI teams are navigating in 2026.

This article puts the four most significant open-weight releases of 2025 through the same gauntlet of benchmarks and extracts what actually matters for production deployments: which model wins on coding, which wins on multilingual, where closed models retain a real advantage, and what the cost arithmetic looks like for teams serious about self-hosting.

The Four Contenders

Meta's Llama 4 family (April 5, 2025) marked a decisive pivot to mixture-of-experts (MoE) architecture. Llama 4 Maverick activates 17 billion parameters from a pool of approximately 400 billion, routing each token through 2 of 128 experts, with native multimodal input for text and images. Its sibling, Scout, uses the same 17B-active design with only 16 experts but advertises a 10-million-token context window — the longest announced for any open-weight model at that date. Both are released under the Meta Llama 4 Community License, which permits commercial use without fees for most applications but restricts redistribution of derivative models above a defined usage threshold.

Alibaba's Qwen 3 (April 28, 2025) arrived 23 days later with eight model sizes from 0.6B to 235B total parameters, the dense sizes under Apache 2.0. The flagship Qwen3-235B-A22B is a MoE model activating 22 billion parameters with documented support for 119 languages. Its defining engineering feature is a hybrid thinking mode: a user-toggled setting that causes the model to emit extended chain-of-thought traces before answering, making a single model serve as both a fast chat endpoint and a deliberate-reasoning system without maintaining two separate deployments.

DeepSeek's V3 and R1 (December 2024 and January 2025, both MIT-licensed) established the open-weight reasoning baseline that subsequent 2025 releases were measured against. V3 is a 671B-parameter MoE activating 37B parameters per forward pass; R1 layers a reinforcement-learning reasoning pipeline on the same backbone. The accompanying technical report — exhaustive by open-weight standards — set a new norm for disclosure that Alibaba would largely match and Meta would partially follow. The lab publicly announced development of a successor, widely referenced as R2.

Note on DeepSeek R2: No model explicitly named "DeepSeek-R2" had been released with an official technical report as of the most reliable data available for this article. Benchmark figures attributed to an R2 model circulating on social media in Q1–Q2 2025 lacked primary-source provenance. V3 and R1 are used as the DeepSeek reference points throughout.

Mistral AI's Magistral family (mid-2025) is the Paris lab's entry into dedicated reasoning-optimized models, positioned alongside its Codestral line for coding tasks. Mistral's technical disclosures for the Magistral generation were less detailed than those from Meta or Alibaba, and most benchmark data for Magistral comes from independent community evaluators rather than official reports.

Conjecture, marked clearly: Performance figures for Magistral in the sections below are drawn from community benchmark evaluations, not a primary Mistral technical report. Treat these figures as directional until Mistral publishes a full specification with reproducible methodology.

Coding: Where the Gap Has Closed

On HumanEval (pass@1, zero-shot), the top open-weight models crossed 92% in early 2025, matching Claude 3.5 Sonnet and GPT-4o. DeepSeek-R1 posted 92.3% in its January 2025 technical report; Qwen3-235B-A22B with thinking enabled reached a comparable level per Alibaba's April 2025 disclosures. Llama 4 Maverick sits in the 87–90% range in independent evaluations — a meaningful improvement over Llama 3.1-405B's roughly 84%, but behind the top open-weight reasoning tier.

LiveCodeBench, a contamination-resistant benchmark refreshed continuously from real competition problems published after training cutoffs, tells a more demanding story. On its hard-tier problems from late 2024 and early 2025, closed models — particularly o3 and Claude 3.7 Sonnet — maintain a meaningful lead. On medium-difficulty problems, however, Qwen3-32B and Llama 4 Maverick have substantially closed the gap with GPT-4o. Since medium-complexity tasks represent the majority of real production coding load, this is the practically relevant finding for most engineering teams.

SWE-bench Verified, which tests whether models can resolve real GitHub issues from open-source repositories, is complicated by agent scaffolding: the framework matters as much as the base model. Open-weight models fine-tuned for agentic coding with scaffolds like SWE-agent or Agentless have achieved results within 5–8 percentage points of the best Claude-based agent systems when the base model is Llama 4 Maverick or Qwen3-32B. That gap is no longer disqualifying for production software-engineering pipelines, though it remains real.

Reasoning and Math: The Reinforcement-Learning Effect

AIME 2024 has become the de facto open-weight stress test for mathematical reasoning. When DeepSeek-R1 posted 79.8% — per its January 2025 technical report, using a majority-vote pass@1 methodology — it outperformed OpenAI o1's reported 74.3%. That was the first time an open-weight model had beaten a flagship closed reasoning model on a held-out mathematics competition set under comparable evaluation conditions. Qwen3-235B-A22B with thinking enabled reported results in a similar tier in April 2025. Mistral Magistral Medium has been independently benchmarked in the 65–72% range on AIME 2024, competitive with earlier o1 variants but trailing the DeepSeek R1 and Qwen3 frontier.

On MATH-500, the ceiling effect is now pronounced: DeepSeek-R1 scored 97.3%, OpenAI o1 scored 96.4%, and Qwen3-235B with thinking enabled is reported above 96%. These margins fall within benchmark variance, and MATH-500 is approaching saturation as a practical differentiator among frontier models.

GPQA-Diamond — graduate-level questions in biology, chemistry, and physics designed by domain experts to resist web-search contamination — is more discriminating. DeepSeek-R1 scored 71.5%; OpenAI o1 scored 78.0%. That 6.5-point gap on a benchmark explicitly resistant to shortcut reasoning represents a genuine qualitative difference, not noise. Open-weight thinking modes narrow but do not close it. OpenAI's o3, at considerably higher inference compute, extends the lead further.

Multilingual: The Qwen Effect

English-first development has historically been a silent constraint on open-weight models. Qwen 3 breaks the pattern aggressively: 119-language native coverage — with documented strength in Chinese, Japanese, Korean, Arabic, and Southeast Asian languages — reflects Alibaba's core business footprint across Asia. In community multilingual evaluations, Qwen3-32B and above outperform Llama 4 Maverick on tasks involving non-Latin-script languages in formal registers and domain-specific vocabulary. For East Asian, South Asian, or Arabic deployments at scale, Qwen 3 is the practical open-weight default.

Llama 4 Scout and Maverick have materially better multilingual coverage than Llama 3, but training-corpus weighting still reflects Meta's primary user base. European multilingual use cases are competitive; non-European deployments are not.

Mistral Magistral's European provenance gives it a consistent advantage in French, Spanish, Italian, German, and Portuguese that community benchmarks confirm. For EU-based enterprise deployments with data-residency requirements that push toward on-premise hosting, Magistral's EU-origin data story is a practical consideration independent of raw benchmark scores.

Where Closed Models Still Lead

Long-context coherence above 500K tokens. Llama 4 Scout's 10M-token context window is compelling in specification; in practice, needle-in-a-haystack recall accuracy in independent evaluations degrades significantly past 500K tokens. Gemini 2.0 Pro maintains better practical coherence at extreme context lengths. The gap between marketed and practical context capability is largest in open-weight models.
Frontier reasoning (o3 tier). On ARC-AGI and the hardest AIME subsets, o3 maintains a lead that open-weight thinking modes narrow but do not eliminate, likely reflecting both architecture differences and inference-compute budget that closed providers can absorb in ways that open-weight self-hosters cannot.
Video and audio understanding. Llama 4 and Qwen 3 handle text and images; neither matches GPT-4o's native audio understanding or Gemini 2.0's video comprehension. For products built around meeting summaries, video search, or voice interfaces, the open-weight gap is structural, not incremental.
Post-training robustness. Closed models receive continuous RLHF updates post-deployment. Open-weight releases are point-in-time; jailbreak resistance and instruction-following on adversarial edge cases depend on downstream fine-tuners, not the model provider.

The Self-Hosting Cost Math

The following estimates use H100 SXM 80GB list pricing observed on Lambda Labs, RunPod, and CoreWeave in Q1–Q2 2025, and throughput figures derived from community vLLM benchmarks at production batch sizes. These are estimates, not guarantees — your actual numbers will vary with quantization level, batch composition, and spot-market availability.

Llama 4 Maverick — 8× H100 SXM

17B active / ~400B total, int8 quantization

At approximately $2.50/hr per H100 on Lambda Labs (Q2 2025): $20/hr fully loaded. At 1,500 output tokens/second throughput and 60% GPU utilization, effective cost is approximately $8–$14 per million output tokens. Together AI's Maverick API was listed at approximately $0.85/M output tokens in mid-2025. Self-hosting becomes economically rational above roughly 8–10 million output tokens per day of consistent sustained load — before that threshold, the API wins once engineering and idle-GPU overhead are factored in.

Qwen3-235B-A22B — 8× H100 SXM

22B active / 235B total, thinking mode disabled

Full expert weights must reside in GPU memory; recommended configuration is 8× H100 NVL or SXM: $20–$30/hr. Thinking mode generates 3–5× the output token count per query, multiplying effective inference cost proportionally. Reserve thinking mode for batch workloads, not latency-sensitive user-facing APIs. Apache 2.0 license imposes no royalty overhead on commercial use, which meaningfully changes the long-run cost comparison.

DeepSeek-V3 / R1 — 16× H100 SXM

37B active / 671B total, full precision

Full-precision deployment requires approximately 16× H100 SXM 80GB for comfortable memory headroom: $40–$50/hr. Community int4 quantizations fit on 8× H100 but introduce measurable quality degradation on MATH-500 and GPQA-Diamond tasks. DeepSeek's own API charged approximately $0.27/M input and $1.10/M output in H1 2025 — one of the cheapest frontier-model APIs available — making full-scale self-hosting rational only at very high sustained volumes (100M+ tokens/day).

The Builder's Takeaway

For the majority of production coding and structured-reasoning workloads, open-weight models at the 17B–32B active-parameter scale have achieved practical parity with GPT-4o. Teams with data-residency requirements, high token volumes, or fine-tuning needs have a credible open-weight path for tasks that previously required a closed-model API by default. The burden of proof has shifted: not evaluating an open-weight model now requires justification, not the reverse.

The remaining frontier-model advantages — o3-tier reasoning, long-video understanding, and continuously updated alignment — apply to a narrower slice of production use cases than vendor benchmark marketing suggests. The better question in 2026 is not "open or closed" but "which open-weight model for this specific workload" — and that question is now answerable with workload-specific evaluation rather than defaulting to any single provider's API.

Frequently asked

Is Llama 4 Maverick good enough to replace GPT-4o for production coding tasks?

For medium-complexity coding tasks — which represent the majority of real production load — Llama 4 Maverick and Qwen3-32B benchmark within a few percentage points of GPT-4o on contamination-resistant evaluations like LiveCodeBench. For the hardest tier, involving complex multi-file edits or deeply ambiguous specifications, closed models with continuous post-training updates still hold a real advantage. Evaluate on your own workload distribution before committing; aggregate benchmarks mask significant task-type variance.

Which open-weight model is best for a multilingual product?

Qwen 3 is the strongest open-weight option for non-European multilingual deployments, with documented coverage of 119 languages and particular strength in Chinese, Japanese, Korean, and Arabic. For European-language-only deployments, Mistral Magistral is competitive in French, Spanish, Italian, and German, with EU-origin data provenance that is useful for GDPR-sensitive applications. Llama 4 Maverick is competitive for English-dominant multilingual use cases but trails Qwen 3 meaningfully on non-Latin-script benchmarks.

Can Qwen3-235B actually run on a single server?

Not at full precision — the full model requires roughly 400–500 GB of GPU memory, meaning 6–8 H100 80GB GPUs at minimum. Community Q4 quantizations reduce this to approximately 120–140 GB (2–4 H100s), but quality degradation on math and reasoning tasks is measurable in independent evaluations. The Qwen3-32B dense model, by contrast, fits on a single H100 at int8 precision and delivers most of the flagship's capability at a fraction of the infrastructure cost — a better choice for most teams starting out.

At what token volume does self-hosting beat API pricing?

For Llama 4 Maverick at H100 rates of approximately $2.50/hr across 8 GPUs, break-even with Together AI's API pricing (approximately $0.85/M output tokens, mid-2025) requires roughly 8–10 million output tokens per day of sustained load, factoring in engineering and idle-GPU overhead. Qwen3-235B-A22B tips even later due to the larger memory footprint. DeepSeek-V3 and R1 self-hosting is hard to justify below very high volumes because DeepSeek's own API pricing is unusually low.

Where do open-weight models still clearly fall behind frontier closed models?

Three areas stand out: graduate-level science reasoning on benchmarks like GPQA-Diamond (open-weight trails by 6–15 points compared to o3 as of mid-2025); practical long-context coherence above 500K tokens (recall degrades significantly despite headline context-window sizes); and video and audio multimodal understanding (no open-weight model currently matches GPT-4o native audio or Gemini 2.0 video comprehension). Post-training robustness is also a structural gap: open-weight releases are point-in-time, while closed models receive continuous alignment updates.

Sources & further reading

Last reviewed May 01, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.