
Every frontier model from 2020 to present
Chronological timeline of 74 frontier AI models from 11 providers. Click any model to see full benchmark scores, API pricing, and capabilities.
Successor to GLM 5.1. AA Intelligence Index 50.7, GPQA 89.5, Terminal-Bench 50.8 — open-weight reasoning gains over 5.1.
First public Mythos-class model — a tier above Opus. Highest AA Intelligence Index ever (64.9), Arena Elo 1932, SWE-bench Pro 80.3. Stripe compressed 2+ months of work into 1 day on a 50M-line codebase.
Frontier-tier successor to M2.7. AA Intelligence Index 54.7, GPQA 92.9, Arena 1668 — continues the open-weight cost-disruption pattern.
New frontier leader. SWE-bench Pro 69.2% tops GPT-5.5 and Gemini 3.1 Pro; ~4x less likely to let code flaws pass unremarked. Same pricing as 4.7, fast mode 2.5x speed.
Flash-tier model with near-frontier reasoning: GPQA 92.2, ARC-AGI-1 96%. AA Intelligence Index 55.3 — top of the cost-efficient tier.
Frontier-tier Qwen flagship. AA Intelligence Index 56.6, GPQA 92.3, Terminal-Bench 50.8 — closes the gap to Western frontier at lower cost.
Flagship successor to V3 (Max Effort reasoning). Open-weight, competitive with closed frontier at a fraction of the cost. Awaiting Multimodal eval coverage.
Frontier reasoning model. #1 on AA Intelligence Index (60.24), tops Arena Elo at 1781, crosses 90% on ARC-AGI-2, and leads Terminal-Bench 2.0 at 82.7%.
Successor to K2.5 with stronger reasoning and Arena Elo; trades some agentic score for benchmark gains
Top-tier Qwen variant above Plus, highest Arena Elo in the family at 1511
SuperGrok Heavy beta with native video understanding. AA now tracking with full per-benchmark coverage; #2 globally on initial AA pass.
1M context, stronger coding and vision. SWE-bench Pro 64.3% beats GPT-5.4 and Gemini 3.1 Pro.
Meta's first proprietary model from Superintelligence Labs, non-open-weight
Apache 2.0 dense 31B from the Gemma 4 family. Native vision + audio, 140+ languages. Best open-weight from Google Q2 2026.
Agentic coding focus, 1M context, strong multimodal performance
Two-version jump from M2.1 (skipping intermediate M2.5). AA Intelligence Index 49.6 — top-tier Chinese open-weight reasoning at half the cost of Western frontier.
First mainline model incorporating GPT-5.3-Codex coding capabilities. Native computer use, 1M context, surpasses human baseline on OSWorld. Codex branch ends here.
Highest GPQA Diamond score ever at 94.3%, doubled ARC-AGI-2 to 77.1%
Opus-level performance at one-fifth the cost, default model on claude.ai
Four specialized agents debate before answering, 65% hallucination reduction
Open-weight 397B MoE with visual agentic capabilities, 201 languages
Final specialized Codex release before coding was folded into mainline GPT-5.4. Peak on LiveCodeBench and Terminal-Bench.
1M context with parallel agent teams, #1 on Arena Elo at 1504
Top-performing Chinese open model, strong coding + reasoning
1T MoE with 100-agent swarm coordination, 99% HumanEval
Reasoning-focused flagship, competitive with frontier Western models
Flash-tier model outperforming previous-gen Pro on most benchmarks
First model to score 100% on AIME 2025 and 80% on SWE-bench
Open-weight coding model, 72.2% SWE-bench Verified at 7x lower cost than Sonnet
Major leap for Mistral, closed the gap with US frontier labs
Matched GPT-5 class at a fraction of the cost, open-weight MoE
First model to break 80% on SWE-bench Verified
First model to break 1500 Arena Elo, 100% on AIME with code execution
#1 on Arena thinking mode, 1M context, strong agentic coding
Incremental upgrade with improved reliability and instruction following
Haiku-tier model matching Sonnet 4 on coding at one-third the cost
First Sonnet to score 100% on AIME, closed the Opus gap entirely
Unified reasoning and chat, 400K context, 94.6% on AIME 2025
Improved agentic reliability with better tool use and planning
Open-weight MoE that matched frontier closed models on coding
Deep reasoning model, strongest on abstract math at launch
First Flash model with thinking, near-Pro performance at one-eighth the price
Strongest agentic model at launch, sustained multi-hour autonomous coding
Matched Opus 4 on coding at one-fifth the price, 1M context window
Hybrid reasoning MoE trained on 36T tokens across 119 languages
Full reasoning model with tool use, first to break 80% on GPQA Diamond
First 10M token context window, 17B active params from 109B MoE
First open MoE from Meta, natively multimodal with 1M context
Built-in thinking mode, native audio and video understanding
First hybrid reasoning model, extended thinking mode for complex problems
Matched frontier labs on reasoning, trained on 200K H100 cluster
Reasoning at 75% lower cost than o1, made chain-of-thought economically viable
Open-weight reasoning model that triggered the DeepSeek market shock
Trained for $5.5M, proved frontier performance was possible at low cost
First reasoning model, uses chain-of-thought at inference time to solve hard problems
Near-frontier performance at flash pricing, native tool use and code execution
405B-class performance distilled into 70B parameters
Best coding model of 2024, dominated SWE-bench for months
Trained on Colossus, xAI's 100K GPU cluster, first competitive Grok
European frontier model with strong multilingual and code performance
Largest open-weight model at 405B parameters, GPT-4 class performance
Natively multimodal with voice, 2x faster and 50% cheaper than GPT-4 Turbo
Open-weight model competitive with GPT-4 class, massive fine-tuning ecosystem
First Claude to match GPT-4, introduced the Opus/Sonnet/Haiku tier system
First 1M token context window, processed entire codebases in one pass
Google's first natively multimodal model, launched the Gemini brand
Open-weight model that kickstarted the open-source LLM ecosystem
First Claude with 100K context, established Anthropic as a frontier lab
Powered Bard and Google Workspace AI, strong multilingual performance
First multimodal GPT, passed the bar exam, defined the frontier for a year
Launched ChatGPT, fastest consumer product to 100M users in history
Largest dense model at launch, first to show chain-of-thought reasoning at scale
Proved most LLMs were undertrained, reshaped scaling strategy industry-wide