Where Open-Weight Has Caught Closed: Llama 4, Qwen 3, DeepSeek, and Magistral Compared
By mid-2025, the benchmark gap between the best open-weight language models and OpenAI's GPT-4o had narrowed to the point where a careful analyst had to squint to see it on coding and structured-reasoning tasks. That milestone — projected for years, raced against by incumbents — now defines the competitive landscape that builders, enterprises, and AI teams are navigating in 2026.
This article puts the four most significant open-weight releases of 2025 through the same gauntlet of benchmarks and extracts what actually matters for production deployments: which model wins on coding, which wins on multilingual, where closed models retain a real advantage, and what the cost arithmetic looks like for teams serious about self-hosting.
The Four Contenders
Meta's Llama 4 family (April 5, 2025) marked a decisive pivot to mixture-of-experts (MoE) architecture. Llama 4 Maverick activates 17 billion parameters from a pool of approximately 400 billion, routing each token through 2 of 128 experts, with native multimodal input for text and images. Its sibling, Scout, uses the same 17B-active design with only 16 experts but advertises a 10-million-token context window — the longest announced for any open-weight model at that date. Both are released under the Meta Llama 4 Community License, which permits commercial use without fees for most applications but restricts redistribution of derivative models above a defined usage threshold.
Alibaba's Qwen 3 (April 28, 2025) arrived 23 days later with eight model sizes from 0.6B to 235B total parameters, the dense sizes under Apache 2.0. The flagship Qwen3-235B-A22B is a MoE model activating 22 billion parameters with documented support for 119 languages. Its defining engineering feature is a hybrid thinking mode: a user-toggled setting that causes the model to emit extended chain-of-thought traces before answering, making a single model serve as both a fast chat endpoint and a deliberate-reasoning system without maintaining two separate deployments.
DeepSeek's V3 and R1 (December 2024 and January 2025, both MIT-licensed) established the open-weight reasoning baseline that subsequent 2025 releases were measured against. V3 is a 671B-parameter MoE activating 37B parameters per forward pass; R1 layers a reinforcement-learning reasoning pipeline on the same backbone. The accompanying technical report — exhaustive by open-weight standards — set a new norm for disclosure that Alibaba would largely match and Meta would partially follow. The lab publicly announced development of a successor, widely referenced as R2.
Mistral AI's Magistral family (mid-2025) is the Paris lab's entry into dedicated reasoning-optimized models, positioned alongside its Codestral line for coding tasks. Mistral's technical disclosures for the Magistral generation were less detailed than those from Meta or Alibaba, and most benchmark data for Magistral comes from independent community evaluators rather than official reports.
Coding: Where the Gap Has Closed
On HumanEval (pass@1, zero-shot), the top open-weight models crossed 92% in early 2025, matching Claude 3.5 Sonnet and GPT-4o. DeepSeek-R1 posted 92.3% in its January 2025 technical report; Qwen3-235B-A22B with thinking enabled reached a comparable level per Alibaba's April 2025 disclosures. Llama 4 Maverick sits in the 87–90% range in independent evaluations — a meaningful improvement over Llama 3.1-405B's roughly 84%, but behind the top open-weight reasoning tier.
LiveCodeBench, a contamination-resistant benchmark refreshed continuously from real competition problems published after training cutoffs, tells a more demanding story. On its hard-tier problems from late 2024 and early 2025, closed models — particularly o3 and Claude 3.7 Sonnet — maintain a meaningful lead. On medium-difficulty problems, however, Qwen3-32B and Llama 4 Maverick have substantially closed the gap with GPT-4o. Since medium-complexity tasks represent the majority of real production coding load, this is the practically relevant finding for most engineering teams.
SWE-bench Verified, which tests whether models can resolve real GitHub issues from open-source repositories, is complicated by agent scaffolding: the framework matters as much as the base model. Open-weight models fine-tuned for agentic coding with scaffolds like SWE-agent or Agentless have achieved results within 5–8 percentage points of the best Claude-based agent systems when the base model is Llama 4 Maverick or Qwen3-32B. That gap is no longer disqualifying for production software-engineering pipelines, though it remains real.
Reasoning and Math: The Reinforcement-Learning Effect
AIME 2024 has become the de facto open-weight stress test for mathematical reasoning. When DeepSeek-R1 posted 79.8% — per its January 2025 technical report, using a majority-vote pass@1 methodology — it outperformed OpenAI o1's reported 74.3%. That was the first time an open-weight model had beaten a flagship closed reasoning model on a held-out mathematics competition set under comparable evaluation conditions. Qwen3-235B-A22B with thinking enabled reported results in a similar tier in April 2025. Mistral Magistral Medium has been independently benchmarked in the 65–72% range on AIME 2024, competitive with earlier o1 variants but trailing the DeepSeek R1 and Qwen3 frontier.
On MATH-500, the ceiling effect is now pronounced: DeepSeek-R1 scored 97.3%, OpenAI o1 scored 96.4%, and Qwen3-235B with thinking enabled is reported above 96%. These margins fall within benchmark variance, and MATH-500 is approaching saturation as a practical differentiator among frontier models.
GPQA-Diamond — graduate-level questions in biology, chemistry, and physics designed by domain experts to resist web-search contamination — is more discriminating. DeepSeek-R1 scored 71.5%; OpenAI o1 scored 78.0%. That 6.5-point gap on a benchmark explicitly resistant to shortcut reasoning represents a genuine qualitative difference, not noise. Open-weight thinking modes narrow but do not close it. OpenAI's o3, at considerably higher inference compute, extends the lead further.
Multilingual: The Qwen Effect
English-first development has historically been a silent constraint on open-weight models. Qwen 3 breaks the pattern aggressively: 119-language native coverage — with documented strength in Chinese, Japanese, Korean, Arabic, and Southeast Asian languages — reflects Alibaba's core business footprint across Asia. In community multilingual evaluations, Qwen3-32B and above outperform Llama 4 Maverick on tasks involving non-Latin-script languages in formal registers and domain-specific vocabulary. For East Asian, South Asian, or Arabic deployments at scale, Qwen 3 is the practical open-weight default.
Llama 4 Scout and Maverick have materially better multilingual coverage than Llama 3, but training-corpus weighting still reflects Meta's primary user base. European multilingual use cases are competitive; non-European deployments are not.
Mistral Magistral's European provenance gives it a consistent advantage in French, Spanish, Italian, German, and Portuguese that community benchmarks confirm. For EU-based enterprise deployments with data-residency requirements that push toward on-premise hosting, Magistral's EU-origin data story is a practical consideration independent of raw benchmark scores.
Where Closed Models Still Lead
- Long-context coherence above 500K tokens. Llama 4 Scout's 10M-token context window is compelling in specification; in practice, needle-in-a-haystack recall accuracy in independent evaluations degrades significantly past 500K tokens. Gemini 2.0 Pro maintains better practical coherence at extreme context lengths. The gap between marketed and practical context capability is largest in open-weight models.
- Frontier reasoning (o3 tier). On ARC-AGI and the hardest AIME subsets, o3 maintains a lead that open-weight thinking modes narrow but do not eliminate, likely reflecting both architecture differences and inference-compute budget that closed providers can absorb in ways that open-weight self-hosters cannot.
- Video and audio understanding. Llama 4 and Qwen 3 handle text and images; neither matches GPT-4o's native audio understanding or Gemini 2.0's video comprehension. For products built around meeting summaries, video search, or voice interfaces, the open-weight gap is structural, not incremental.
- Post-training robustness. Closed models receive continuous RLHF updates post-deployment. Open-weight releases are point-in-time; jailbreak resistance and instruction-following on adversarial edge cases depend on downstream fine-tuners, not the model provider.
The Self-Hosting Cost Math
The following estimates use H100 SXM 80GB list pricing observed on Lambda Labs, RunPod, and CoreWeave in Q1–Q2 2025, and throughput figures derived from community vLLM benchmarks at production batch sizes. These are estimates, not guarantees — your actual numbers will vary with quantization level, batch composition, and spot-market availability.
The Builder's Takeaway
For the majority of production coding and structured-reasoning workloads, open-weight models at the 17B–32B active-parameter scale have achieved practical parity with GPT-4o. Teams with data-residency requirements, high token volumes, or fine-tuning needs have a credible open-weight path for tasks that previously required a closed-model API by default. The burden of proof has shifted: not evaluating an open-weight model now requires justification, not the reverse.
The remaining frontier-model advantages — o3-tier reasoning, long-video understanding, and continuously updated alignment — apply to a narrower slice of production use cases than vendor benchmark marketing suggests. The better question in 2026 is not "open or closed" but "which open-weight model for this specific workload" — and that question is now answerable with workload-specific evaluation rather than defaulting to any single provider's API.
Frequently asked
Is Llama 4 Maverick good enough to replace GPT-4o for production coding tasks?
Which open-weight model is best for a multilingual product?
Can Qwen3-235B actually run on a single server?
At what token volume does self-hosting beat API pricing?
Where do open-weight models still clearly fall behind frontier closed models?
Sources & further reading
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv, January 2025)
- Meta AI: Introducing Llama 4 — Multimodal Intelligence at Scale (Official Blog, April 2025)
- Qwen Team: Qwen3 — Think Deeper, Act Faster (Official Blog, April 2025)
- LMSYS Chatbot Arena Leaderboard (lmarena.ai)
- LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (arXiv, 2023)
Last reviewed May 01, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.