OpenAI o3 and o4-mini: Record Benchmarks, $40/M Output, and the Rate-Limit Wall
In mid-April 2025, OpenAI released o3 and o4-mini into general API availability simultaneously. Both are reasoning models — they generate a hidden chain-of-thought before producing a final response — and both can invoke tools (web search, code execution, image analysis) during that reasoning phase rather than only before or after it. That architectural detail is what separates them from every earlier thinking model.
The headline benchmark numbers are real, independently reproducible, and record-setting as of their release date. Whether they justify the pricing and the access hurdles developers actually encounter is the harder question.
What OpenAI Released
o3 is the full-capability successor to o1 (September 2024) and o3-mini (January 2025), jumping the "o2" designation publicly to avoid conflict with O2, the UK telecommunications company. o4-mini is the companion model: same 200K-token context window, roughly 9× lower output cost, and full multimodal reasoning — meaning it can actively examine and reason about images within its thinking chain, not merely accept them as passive input context.
Both models accept structured function-calling, file attachments, web search, and code execution as tools available inside the reasoning phase. Prior agentic pipelines required developers to orchestrate retrieve-then-reason loops manually. With o3 and o4-mini, the retrieval decision can be delegated to the model itself within a single API call, which substantially reduces boilerplate for common agentic use cases.
Benchmark Results: The Numbers That Matter
OpenAI published evaluations across four well-known suites, all independently reproducible by researchers with API access:
- GPQA Diamond (448 graduate-level science questions designed to stump experts outside their specialty): o3 scored 87.7% as of April 2025. Domain experts average roughly 65% on this set. Gemini 2.5 Pro held the prior frontier at 84.0% (March 2025); Claude 3.7 Sonnet was in the same range.
- AIME 2024 (American Invitational Mathematics Examination, 15-question competition math): o3 at 96.7% pass@1; o4-mini at approximately 93.4%. Both represent consistent doctoral-level mathematical performance on a fixed, publicly available test set.
- SWE-bench Verified (500 real-world GitHub issues requiring working code fixes): o3 at 71.7%, surpassing Claude 3.7 Sonnet's 70.3% record set in February 2025. SWE-bench Verified requires resolving actual bugs in real repositories — not generating plausible-looking patches against synthetic inputs — making it the highest-signal software engineering benchmark currently available.
- ARC-AGI-1 (abstract visual reasoning, deliberately constructed to resist training-data memorization): o3 at 75.7% zero-shot. Human average is approximately 85%. This was the highest score any AI system had publicly achieved on this benchmark at time of release. ARC Prize organizers called it "notable" while emphasizing the remaining human-AI performance gap.
Pricing: The $40/M Output Reality
OpenAI set the following list prices at general availability in April 2025:
- o3:
$10.00/M input tokens ·$40.00/M output tokens ·$2.50/M for cached input (75% discount) - o4-mini:
$1.10/M input tokens ·$4.40/M output tokens ·$0.275/M for cached input
At $40/M output, a single o3 response averaging 2,000 output tokens costs approximately $0.08. Scale that to 10,000 queries per day and output costs alone reach $800/day — before accounting for input tokens, caching fees, or the reasoning tokens the model generates internally but you never see. o4-mini's $4.40/M output brings those same 10,000 queries to roughly $88/day: meaningful but defensible for high-value automated workflows where inference cost is a small fraction of task value.
ChatGPT Pro ($200/month flat rate) provides unlimited o3 access under fair-use guidelines for individuals. For API builders, o4-mini is the realistic production workhorse; o3 is rationally reserved for tasks where benchmark-level precision materially changes a business outcome — legal document analysis, complex scientific literature synthesis, high-stakes code refactors.
The Rate-Limit Wall
Pricing is only half the access story. OpenAI gates API throughput behind a tiered spend-history system that new developers frequently underestimate:
- Tier 1 (new accounts, under $100 cumulative spend): o3 access is throttled heavily regardless of payment method. New API users cannot achieve meaningful production throughput on o3 even at published per-token prices.
- Tier 2–4 ($100–$1,000 cumulative spend): Progressively higher TPM and RPM ceilings; o3 unlocks later in the progression than GPT-4o-class models at equivalent spend tiers.
- Tier 5 (over $1,000 cumulative spend): Highest public limits; o4-mini reaches up to 10M tokens per minute and 10K requests per minute. o3 Tier 5 limits were not publicly confirmed across all configurations at launch.
The practical effect: teams building production systems on o3 must either accumulate API spend history over weeks or negotiate an enterprise agreement. The high per-token price simultaneously acts as a usage gate and a compute-funding mechanism — a deliberate choice to limit who can achieve high throughput on the most capable model while ensuring the revenue to serve it at scale.
Tool Use During Thinking: What It Actually Means in Practice
When o3 invokes a web search, that search result enters the hidden reasoning chain — the model reasons about retrieved content before synthesizing a final response. A single API call can trigger a real-time lookup, parse structured data, run a calculation, examine an image, and reason across all of it without explicit developer orchestration. For many agentic tasks, this replaces a multi-step pipeline with a single call.
The billing caveat is material: every token generated during the thinking phase is charged as output tokens at the same rate as the final visible response. A query that triggers two web searches and one code execution may accumulate three to ten times more output tokens than the final text suggests. Production deployments should log the usage.completion_tokens_details.reasoning_tokens field explicitly on every response to understand true per-call costs before committing to a pricing model.
Competitive Snapshot (April 2025)
At launch, o3 and o4-mini faced three immediately relevant alternatives:
- Gemini 2.5 Pro (Google DeepMind, March 2025): 84.0% GPQA Diamond; $1.25/M input, $10/M output for prompts under 200K tokens. Significantly cheaper than o3 for comparable context length; topped LMArena human-preference rankings at its own launch date.
- Claude 3.7 Sonnet (Anthropic, February 2025): 70.3% SWE-bench Verified — the record o3 displaced at launch; $3/M input, $15/M output; 200K context; extended thinking with an explicit developer-controlled token budget, giving more cost predictability than o3's uncapped reasoning.
- Llama 4 Maverick (Meta, April 2025): Open weights under Meta's community license; strong multimodal performance; effectively free for developers willing to self-host or use Meta's inference API endpoints, with no per-token cost at the model layer.
o3 held a measurable benchmark lead on GPQA Diamond (87.7% vs 84.0% Gemini) and ARC-AGI-1 at launch. Whether that translates to real product-level quality improvement depends entirely on the specific task — benchmark leadership and deployment-level impact are correlated but not identical, and the pricing gap is wide enough that empirical testing on your actual use case is mandatory before committing to o3 in production.
Frequently asked
What is the core difference between o3 and o4-mini?
Why do thinking model output tokens cost so much in practice?
How does o3's ARC-AGI-1 score compare to human performance?
Can developers access o3 without building API spend history first?
Why did OpenAI skip the 'o2' version number?
Sources & further reading
- OpenAI: Introducing o3 and o4-mini (April 2025)
- OpenAI API Pricing
- ARC Prize Foundation — ARC-AGI-1 Leaderboard
- SWE-bench Verified Leaderboard (Princeton NLP)
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Rein et al. (2023)
- OpenAI o3 System Card
Last reviewed May 07, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.