
Measuring the race toward superintelligence
Standardized benchmarks are how we measure the march toward superintelligence. Each metric captures a different dimension of capability — from reasoning and coding to scientific knowledge and common sense. Watch the numbers climb.
SOURCES: model system cards, official benchmark reports, LMSYS Arena. UPDATED: 2026-02-06
Compare AI models across standardized benchmarks.
MODEL | PROVIDER | DATE | MMLU | HumanEval | MATH | GPQA | ARC-C | HellaSwag | SWE-bench | Arena Elo↓ | Terminal | OSWorld | ARC-AGI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemini 3 ProMM | Nov 25 | 93.2% | 96.2% | 100% | 91.9% | 98.8% | 97.9% | 76.2% | 1501 | — | — | 45.1% | |
Grok 4.1MM | xAI | Nov 25 | 93.1% | 96.1% | 98.4% | 88% | 98.5% | 97.2% | 74.6% | 1483 | — | — | — |
Kimi K2OS | Moonshot | Jul 25 | 89.5% | 85.7% | 97.4% | 75.1% | — | — | 65.8% | 1473 | — | — | — |
GPT-5.2MM | OpenAI | Dec 25 | 93.8% | 96.9% | 100% | 93.2% | 98.9% | 98.2% | 80% | 1458 | — | — | 54.2% |
Gemini 2.5 ProMM | Mar 25 | 90.8% | 93.2% | 89.5% | 84% | 97.8% | 96.2% | 59.6% | 1451 | — | — | — | |
Kimi K2.5MMOS | Moonshot | Jan 26 | — | — | — | 87.6% | — | — | 76.8% | 1450 | — | — | — |
Claude Opus 4.5MM | Anthropic | Nov 25 | 92.8% | 96.4% | 100% | 87% | 98.6% | 96.8% | 80.9% | 1445 | 59.8% | — | 37.6% |
Gemini 3 FlashMM | Dec 25 | 91.2% | 94.8% | 94.6% | 90.4% | 98.4% | 97.1% | 72% | 1428 | — | — | — | |
GPT-5.1MM | OpenAI | Nov 25 | 92.4% | 95.6% | 97.8% | 88.1% | 98.4% | 97.8% | 76.3% | 1425 | — | — | — |
DeepSeek V3.2OS | DeepSeek | Dec 25 | 92.6% | 95.4% | 96.8% | 86.2% | 98.2% | 96.8% | 72.8% | 1418 | — | — | — |
Claude Sonnet 4.5MM | Anthropic | Sep 25 | 91.5% | 95.8% | 100% | 83.4% | 98.2% | 95.1% | 77.2% | 1412 | — | — | — |
Grok 4MM | xAI | Jul 25 | 92.4% | 95.2% | 94% | 88% | 98.1% | 96.4% | 68.4% | 1408 | — | — | — |
Grok 3MM | xAI | Feb 25 | 91.2% | 93.5% | 93.3% | 84.6% | 97.5% | — | 47.8% | 1402 | — | — | — |
GPT-5MM | OpenAI | Aug 25 | 91.2% | 94.8% | 94.6% | 88.4% | 98.1% | 97.2% | 74.9% | 1398 | — | — | — |
Claude Opus 4.1MM | Anthropic | Aug 25 | 89.5% | 94.5% | 88.4% | 76.2% | 97.6% | 92.8% | 74.5% | 1372 | — | — | — |
Claude Sonnet 4MM | Anthropic | May 25 | 89.2% | 94.1% | 86.2% | 74.8% | 97.5% | 92.4% | 72.7% | 1368 | — | — | — |
o3-mini | OpenAI | Jan 25 | 87.5% | 93.1% | 97.3% | 77% | — | — | 49.3% | 1361 | — | — | — |
DeepSeek R1OS | DeepSeek | Jan 25 | 90.8% | 92.8% | 97.3% | 71.5% | 97.1% | — | 49.2% | 1358 | — | — | — |
Mistral Large 3 | Mistral | Dec 25 | 89.8% | 93.6% | 85.4% | 72.1% | 97.2% | 94.8% | 58.6% | 1352 | — | — | — |
o1 | OpenAI | Dec 24 | 91.8% | 92.4% | 96.4% | 78% | 97.8% | — | 48.9% | 1350 | — | — | — |
Claude Opus 4MM | Anthropic | May 25 | 88.8% | 93.2% | 84.6% | 72.4% | 97.2% | 91.2% | 72.5% | 1342 | — | — | — |
Llama 4 MaverickMMOS | Meta | Apr 25 | 89.4% | 91.8% | 82.6% | 68.2% | 97.1% | 93.4% | 52.4% | 1328 | — | — | — |
DeepSeek V3OS | DeepSeek | Dec 24 | 88.5% | 92.1% | 90.2% | 59.1% | 96.8% | 88.9% | 42% | 1318 | — | — | — |
Claude 3.7 SonnetMM | Anthropic | Feb 25 | 86.1% | 93.7% | 96.2% | 78.2% | 96.7% | 89% | 62.3% | 1310 | — | — | — |
Gemini 2.0 FlashMM | Dec 24 | 88.1% | 89.5% | 77.2% | 62.1% | 96.2% | 94.1% | — | 1290 | — | — | — | |
GPT-4oMM | OpenAI | May 24 | 88.7% | 90.2% | 76.6% | 53.6% | 96.7% | 95.3% | 38.4% | 1285 | — | — | — |
Claude 3.5 SonnetMM | Anthropic | Oct 24 | 88.7% | 93.7% | 78.3% | 65% | 96.7% | 89% | 49% | 1280 | — | — | — |
Gemini 1.5 ProMM | Feb 24 | 85.9% | 84.1% | 67.7% | 46.2% | 94.4% | 92.5% | — | 1260 | — | — | — | |
Grok 2MM | xAI | Aug 24 | 87.5% | 88.4% | 76.1% | 56% | 96.4% | — | — | 1256 | — | — | — |
Claude 3 OpusMM | Anthropic | Mar 24 | 86.8% | 84.9% | 60.1% | 50.4% | 96.4% | 95.4% | — | 1248 | — | — | — |
Llama 3.3 70BOS | Meta | Dec 24 | 86% | 88.4% | 77% | 49% | 94.8% | 86.2% | — | 1247 | — | — | — |
Llama 3.1 405BOS | Meta | Jul 24 | 88.6% | 89% | 73.8% | 51.1% | 96.9% | 89.2% | — | 1221 | — | — | — |
Llama 3 70BOS | Meta | Apr 24 | 82% | 81.7% | 50.4% | 41.2% | 93% | 88% | — | 1208 | — | — | — |
Mistral Large 2 | Mistral | Jul 24 | 84% | 92.1% | 69.2% | 46.3% | 94.2% | 89.4% | — | 1178 | — | — | — |
GPT-5.3 Codex | OpenAI | Feb 26 | — | — | — | — | — | — | — | — | 77.3% | 64.7% | — |
Claude Opus 4.6MM | Anthropic | Feb 26 | — | — | — | 91.3% | — | — | 80.8% | — | 65.4% | 72.7% | 68.8% |
o3 | OpenAI | Apr 25 | — | — | 91.6% | 83.3% | — | — | 69.1% | — | — | — | — |
Gemini 1.0 ProMM | Dec 23 | 83.7% | 74.3% | 58.6% | 42.1% | 93.8% | 92.1% | — | — | — | — | — | |
Llama 2 70BOS | Meta | Jul 23 | 68.9% | 48.8% | 25.4% | — | 85.3% | 85.9% | — | — | — | — | — |
Claude 2 | Anthropic | Jul 23 | 78.5% | 71.2% | 42.6% | — | 93.2% | 89.1% | — | — | — | — | — |
PaLM 2 | May 23 | 86.1% | 64.5% | 48.8% | — | 95.1% | 93.4% | — | — | — | — | — | |
GPT-4 | OpenAI | Mar 23 | 86.4% | 67% | 52.9% | 35.7% | 96.3% | 95.3% | — | — | — | — | — |
GPT-3.5 Turbo | OpenAI | Nov 22 | 70% | 48.1% | 35.2% | — | 85.2% | 85.5% | — | — | — | — | — |
PaLM 540B | Apr 22 | 69.3% | 26.2% | 34.8% | — | 84.6% | 83.4% | — | — | — | — | — | |
Chinchilla 70B | Mar 22 | 67.5% | 31.8% | — | — | 83.7% | 82.3% | — | — | — | — | — | |
GPT-3 175B | OpenAI | Jun 20 | 43.9% | 19.1% | — | — | 51.4% | 78.9% | — | — | — | — | — |
Tests knowledge across 57 subjects from STEM to humanities
Coding ability - generating correct Python functions
Competition-level mathematics problems
PhD-level science questions even experts struggle with
Grade-school science questions requiring reasoning
Common sense reasoning about everyday situations
Real-world GitHub issue resolution
Computer use in real desktop environments
Novel reasoning tasks requiring fluid intelligence