Frontier reasoning model. #1 on AA Intelligence Index (60.24), tops Arena Elo at 1781, crosses 90% on ARC-AGI-2, and leads Terminal-Bench 2.0 at 82.7%.
| Benchmark | Score | Rank |
|---|---|---|
MMMUvals.ai College-level multimodal reasoning across 30+ disciplines | 88.3% | #1 / 35 |
Terminal Agentic terminal coding tasks requiring multi-step execution | 82.7% | #1 / 41 |
OSWorld Computer use in real desktop environments | 78.7% | #1 / 8 |
Arena EloArtificial Analysis Human preference ranking via blind comparisons | 1781 | #1 / 46 |
ARC-AGIARC Prize Novel reasoning tasks requiring fluid intelligence | 90.6% | #1 / 23 |
GPQA PhD-level science questions even experts struggle with | 93.5% | #3 / 58 |
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 88.1% | #6 / 32 |
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 85.3% | #9 / 33 |