
Every frontier model from 2020 to present
Chronological timeline of 67 frontier AI models from 12 providers. Click any model to see full benchmark scores, API pricing, and capabilities.
These models released within the last few days. Third-party benchmark coverage is incomplete, so the Singularity Index is withheld until AA, vals.ai, and ARC Prize finish their eval passes.
Top-tier GPT-5.5 variant. 98% on ARC-AGI-1, 90.4% on ARC-AGI-2. Awaiting full third-party eval coverage.
Open-weight 1T MoE (42B active) matching frontier intelligence at half the cost. Strong agentic and reasoning, awaiting coding eval coverage.
Flagship successor to V3 (Max Effort reasoning). Open-weight, competitive with closed frontier at a fraction of the cost. Awaiting Multimodal eval coverage.
Frontier reasoning model. #1 on AA Intelligence Index (60.24), tops Arena Elo at 1781, crosses 90% on ARC-AGI-2, and leads Terminal-Bench 2.0 at 82.7%.
Successor to K2.5 with stronger reasoning and Arena Elo; trades some agentic score for benchmark gains
Top-tier Qwen variant above Plus, highest Arena Elo in the family at 1511
1M context, stronger coding and vision. SWE-bench Pro 64.3% beats GPT-5.4 and Gemini 3.1 Pro.
Meta's first proprietary model from Superintelligence Labs, non-open-weight
Agentic coding focus, 1M context, strong multimodal performance
First mainline model incorporating GPT-5.3-Codex coding capabilities. Native computer use, 1M context, surpasses human baseline on OSWorld. Codex branch ends here.
Highest GPQA Diamond score ever at 94.3%, doubled ARC-AGI-2 to 77.1%
Opus-level performance at one-fifth the cost, default model on claude.ai
Four specialized agents debate before answering, 65% hallucination reduction
Open-weight 397B MoE with visual agentic capabilities, 201 languages
Final specialized Codex release before coding was folded into mainline GPT-5.4. Peak on LiveCodeBench and Terminal-Bench.
1M context with parallel agent teams, #1 on Arena Elo at 1504
Top-performing Chinese open model, strong coding + reasoning
1T MoE with 100-agent swarm coordination, 99% HumanEval
Reasoning-focused flagship, competitive with frontier Western models
Flash-tier model outperforming previous-gen Pro on most benchmarks
First model to score 100% on AIME 2025 and 80% on SWE-bench
Open-weight coding model, 72.2% SWE-bench Verified at 7x lower cost than Sonnet
Major leap for Mistral, closed the gap with US frontier labs
Matched GPT-5 class at a fraction of the cost, open-weight MoE
First model to break 80% on SWE-bench Verified
First model to break 1500 Arena Elo, 100% on AIME with code execution
#1 on Arena thinking mode, 1M context, strong agentic coding
Incremental upgrade with improved reliability and instruction following
Haiku-tier model matching Sonnet 4 on coding at one-third the cost
First Sonnet to score 100% on AIME, closed the Opus gap entirely
Unified reasoning and chat, 400K context, 94.6% on AIME 2025
Improved agentic reliability with better tool use and planning
Open-weight MoE that matched frontier closed models on coding
Deep reasoning model, strongest on abstract math at launch
First Flash model with thinking, near-Pro performance at one-eighth the price
Strongest agentic model at launch, sustained multi-hour autonomous coding
Matched Opus 4 on coding at one-fifth the price, 1M context window
Hybrid reasoning MoE trained on 36T tokens across 119 languages
Full reasoning model with tool use, first to break 80% on GPQA Diamond
First 10M token context window, 17B active params from 109B MoE
First open MoE from Meta, natively multimodal with 1M context
Built-in thinking mode, native audio and video understanding
First hybrid reasoning model, extended thinking mode for complex problems
Matched frontier labs on reasoning, trained on 200K H100 cluster
Reasoning at 75% lower cost than o1, made chain-of-thought economically viable
Open-weight reasoning model that triggered the DeepSeek market shock
Trained for $5.5M, proved frontier performance was possible at low cost
First reasoning model, uses chain-of-thought at inference time to solve hard problems
Near-frontier performance at flash pricing, native tool use and code execution
405B-class performance distilled into 70B parameters
Best coding model of 2024, dominated SWE-bench for months
Trained on Colossus, xAI's 100K GPU cluster, first competitive Grok
European frontier model with strong multilingual and code performance
Largest open-weight model at 405B parameters, GPT-4 class performance
Natively multimodal with voice, 2x faster and 50% cheaper than GPT-4 Turbo
Open-weight model competitive with GPT-4 class, massive fine-tuning ecosystem
First Claude to match GPT-4, introduced the Opus/Sonnet/Haiku tier system
First 1M token context window, processed entire codebases in one pass
Google's first natively multimodal model, launched the Gemini brand
Open-weight model that kickstarted the open-source LLM ecosystem
First Claude with 100K context, established Anthropic as a frontier lab
Powered Bard and Google Workspace AI, strong multilingual performance
First multimodal GPT, passed the bar exam, defined the frontier for a year
Launched ChatGPT, fastest consumer product to 100M users in history
Largest dense model at launch, first to show chain-of-thought reasoning at scale
Proved most LLMs were undertrained, reshaped scaling strategy industry-wide