Every frontier model from 2020 to present
Chronological timeline of 51 frontier AI models from 9 providers. Click any model to see full benchmark scores, API pricing, and capabilities.
Native computer use, 1M context, first to surpass human baseline on OSWorld
Highest GPQA Diamond score ever at 94.3%, doubled ARC-AGI-2 to 77.1%
Opus-level performance at one-fifth the cost, default model on claude.ai
Four specialized agents debate before answering, 65% hallucination reduction
Open-weight 397B MoE with visual agentic capabilities, 201 languages
First AI model instrumental in building its own successor
1M context with parallel agent teams, #1 on Arena Elo at 1504
1T MoE with 100-agent swarm coordination, 99% HumanEval
First model to score 100% on AIME 2025 and 80% on SWE-bench
Flash-tier model outperforming previous-gen Pro on most benchmarks
Major leap for Mistral, closed the gap with US frontier labs
Matched GPT-5 class at a fraction of the cost, open-weight MoE
First model to break 80% on SWE-bench Verified
First model to break 1500 Arena Elo, 100% on AIME with code execution
#1 on Arena thinking mode, 1M context, strong agentic coding
Incremental upgrade with improved reliability and instruction following
First Sonnet to score 100% on AIME, closed the Opus gap entirely
Unified reasoning and chat, 400K context, first model to score 90%+ on AIME
Improved agentic reliability with better tool use and planning
Open-weight MoE that matched frontier closed models on coding
Deep reasoning model, strongest on abstract math at launch
Strongest agentic model at launch, sustained multi-hour autonomous coding
Matched Opus 4 on coding at one-fifth the price, 1M context window
Full reasoning model with tool use, first to break 80% on GPQA Diamond
First open MoE from Meta, natively multimodal with 1M context
Built-in thinking mode, native audio and video understanding
First hybrid reasoning model, extended thinking mode for complex problems
Matched frontier labs on reasoning, trained on 200K H100 cluster
Reasoning at 75% lower cost than o1, made chain-of-thought economically viable
Open-weight reasoning model that triggered the DeepSeek market shock
Trained for $5.5M, proved frontier performance was possible at low cost
First reasoning model, uses chain-of-thought at inference time to solve hard problems
Near-frontier performance at flash pricing, native tool use and code execution
405B-class performance distilled into 70B parameters
Best coding model of 2024, dominated SWE-bench for months
Trained on Colossus, xAI's 100K GPU cluster, first competitive Grok
European frontier model with strong multilingual and code performance
Largest open-weight model at 405B parameters, GPT-4 class performance
Natively multimodal with voice, 2x faster and 50% cheaper than GPT-4 Turbo
Open-weight model competitive with GPT-4 class, massive fine-tuning ecosystem
First Claude to match GPT-4, introduced the Opus/Sonnet/Haiku tier system
First 1M token context window, processed entire codebases in one pass
Google's first natively multimodal model, launched the Gemini brand
Open-weight model that kickstarted the open-source LLM ecosystem
First Claude with 100K context, established Anthropic as a frontier lab
Powered Bard and Google Workspace AI, strong multilingual performance
First multimodal GPT, passed the bar exam, defined the frontier for a year
Launched ChatGPT, fastest consumer product to 100M users in history
Largest dense model at launch, first to show chain-of-thought reasoning at scale
Proved most LLMs were undertrained, reshaped scaling strategy industry-wide
First large language model to demonstrate emergent few-shot learning