🧠
AI News · May 07, 2026
o3 set records on GPQA Diamond (87.7%), SWE-bench (71.7%), and ARC-AGI-1 (75.7%). The practical cost: $40/M output tokens and spend-history gatekeeping.
← All articles

OpenAI o3 and o4-mini: Record Benchmarks, $40/M Output, and the Rate-Limit Wall

AI News Published May 07, 2026 · openai · o3 · reasoning models · llm benchmarks · api pricing

In mid-April 2025, OpenAI released o3 and o4-mini into general API availability simultaneously. Both are reasoning models — they generate a hidden chain-of-thought before producing a final response — and both can invoke tools (web search, code execution, image analysis) during that reasoning phase rather than only before or after it. That architectural detail is what separates them from every earlier thinking model.

The headline benchmark numbers are real, independently reproducible, and record-setting as of their release date. Whether they justify the pricing and the access hurdles developers actually encounter is the harder question.

What OpenAI Released

o3 is the full-capability successor to o1 (September 2024) and o3-mini (January 2025), jumping the "o2" designation publicly to avoid conflict with O2, the UK telecommunications company. o4-mini is the companion model: same 200K-token context window, roughly 9× lower output cost, and full multimodal reasoning — meaning it can actively examine and reason about images within its thinking chain, not merely accept them as passive input context.

Both models accept structured function-calling, file attachments, web search, and code execution as tools available inside the reasoning phase. Prior agentic pipelines required developers to orchestrate retrieve-then-reason loops manually. With o3 and o4-mini, the retrieval decision can be delegated to the model itself within a single API call, which substantially reduces boilerplate for common agentic use cases.

Benchmark Results: The Numbers That Matter

OpenAI published evaluations across four well-known suites, all independently reproducible by researchers with API access:

Note on benchmark integrity: GPQA Diamond, AIME 2024, and ARC-AGI-1 are fixed, public test sets. Contamination — models trained on benchmark questions or structurally similar problems — is possible and difficult to audit externally. SWE-bench Verified is harder to game because it demands functional code against live repositories. The published numbers are credible; independent third-party replication at scale remains the definitive check.

Pricing: The $40/M Output Reality

OpenAI set the following list prices at general availability in April 2025:

At $40/M output, a single o3 response averaging 2,000 output tokens costs approximately $0.08. Scale that to 10,000 queries per day and output costs alone reach $800/day — before accounting for input tokens, caching fees, or the reasoning tokens the model generates internally but you never see. o4-mini's $4.40/M output brings those same 10,000 queries to roughly $88/day: meaningful but defensible for high-value automated workflows where inference cost is a small fraction of task value.

ChatGPT Pro ($200/month flat rate) provides unlimited o3 access under fair-use guidelines for individuals. For API builders, o4-mini is the realistic production workhorse; o3 is rationally reserved for tasks where benchmark-level precision materially changes a business outcome — legal document analysis, complex scientific literature synthesis, high-stakes code refactors.

The Rate-Limit Wall

Pricing is only half the access story. OpenAI gates API throughput behind a tiered spend-history system that new developers frequently underestimate:

The practical effect: teams building production systems on o3 must either accumulate API spend history over weeks or negotiate an enterprise agreement. The high per-token price simultaneously acts as a usage gate and a compute-funding mechanism — a deliberate choice to limit who can achieve high throughput on the most capable model while ensuring the revenue to serve it at scale.

Tool Use During Thinking: What It Actually Means in Practice

When o3 invokes a web search, that search result enters the hidden reasoning chain — the model reasons about retrieved content before synthesizing a final response. A single API call can trigger a real-time lookup, parse structured data, run a calculation, examine an image, and reason across all of it without explicit developer orchestration. For many agentic tasks, this replaces a multi-step pipeline with a single call.

The billing caveat is material: every token generated during the thinking phase is charged as output tokens at the same rate as the final visible response. A query that triggers two web searches and one code execution may accumulate three to ten times more output tokens than the final text suggests. Production deployments should log the usage.completion_tokens_details.reasoning_tokens field explicitly on every response to understand true per-call costs before committing to a pricing model.

Conjecture, marked clearly: OpenAI's thinking-token billing creates a structural incentive worth flagging: models that reason more extensively cost more, and revenue scales with reasoning depth. This does not mean OpenAI deliberately over-provisions thinking tokens — longer chains of thought produce measurably better outputs on hard tasks and the relationship is genuine. But developers should empirically measure their specific tasks' thinking-to-output token ratios before setting budgets. For complex coding and scientific reasoning tasks, observed ratios in early production deployments suggest o3 reasoning tokens commonly run 4–10× the final visible output count, substantially increasing effective per-query cost versus the sticker output price.

Competitive Snapshot (April 2025)

At launch, o3 and o4-mini faced three immediately relevant alternatives:

o3 held a measurable benchmark lead on GPQA Diamond (87.7% vs 84.0% Gemini) and ARC-AGI-1 at launch. Whether that translates to real product-level quality improvement depends entirely on the specific task — benchmark leadership and deployment-level impact are correlated but not identical, and the pricing gap is wide enough that empirical testing on your actual use case is mandatory before committing to o3 in production.

Conjecture, marked clearly: The 4× pricing premium of o3 ($40/M output) over Gemini 2.5 Pro ($10/M output) is difficult to attribute to compute-cost differences alone. Google and OpenAI operate at broadly comparable hardware efficiency for large frontier models; a 4× gap is too large to explain on infrastructure economics without significantly different architectural choices or margin structures. OpenAI appears to be capturing brand premium and enterprise buyer willingness-to-pay. Developers running cost-constrained production workloads should run side-by-side evaluations on their specific tasks — not benchmark tables — before assuming the benchmark premium translates linearly to product quality gains.

Frequently asked

What is the core difference between o3 and o4-mini?
o3 is OpenAI's highest-capability reasoning model as of April 2025, with 87.7% GPQA Diamond and 71.7% SWE-bench Verified. o4-mini is the companion at roughly 9× lower output cost ($4.40/M vs $40/M), with full multimodal reasoning — it can actively reason about images inside its thinking chain. For most production deployments, o4-mini delivers sufficient performance; o3 is reserved for tasks where benchmark-level accuracy materially changes a business outcome.
Why do thinking model output tokens cost so much in practice?
Thinking models generate reasoning tokens — a hidden chain-of-thought — before producing a visible answer, and those reasoning tokens are billed at the same output-token rate as the final text. On complex tasks, o3's internal reasoning can consume 4–10× as many tokens as the visible final response. OpenAI exposes no parameter to cap reasoning depth; developers observe total billed token counts after the fact via the usage.completion_tokens_details.reasoning_tokens field.
How does o3's ARC-AGI-1 score compare to human performance?
Humans average roughly 85% on ARC-AGI-1; o3 scored 75.7% zero-shot at launch in April 2025, the highest any AI system had publicly achieved. ARC-AGI-1 was designed by François Chollet specifically to require fluid reasoning that cannot be solved through training-data memorization. The ARC Prize Foundation called the score notable while emphasizing the remaining human advantage — o3 is below average human performance, not above it.
Can developers access o3 without building API spend history first?
Yes, via ChatGPT Pro at $200/month flat, which provides unlimited o3 access for individual use under fair-use guidelines. For API access, OpenAI's tiered spend-history system heavily throttles new accounts on o3 regardless of payment method. Teams that need high API throughput immediately should pursue OpenAI's enterprise agreement process rather than trying to build spend history organically.
Why did OpenAI skip the 'o2' version number?
OpenAI publicly moved from o1 (September 2024) to o3-mini (January 2025) and o3 (April 2025), skipping o2 in the public naming sequence. The most cited reason is potential trademark conflict with O2, the major UK telecommunications operator owned by Virgin Media O2. The gap has no technical significance — there is no unreleased o2 model — and follows a pattern of tech companies navigating naming conflicts by skipping version numbers.

Sources & further reading

  1. OpenAI: Introducing o3 and o4-mini (April 2025)
  2. OpenAI API Pricing
  3. ARC Prize Foundation — ARC-AGI-1 Leaderboard
  4. SWE-bench Verified Leaderboard (Princeton NLP)
  5. GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Rein et al. (2023)
  6. OpenAI o3 System Card

Last reviewed May 07, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.