#1 on Arena thinking mode, 1M context, strong agentic coding
| Benchmark | Score | Rank |
|---|---|---|
MMLU Tests knowledge across 57 subjects from STEM to humanities | 93.1% | #3 / 53 |
ARC-C Grade-school science questions requiring reasoning | 98.5% | #4 / 40 |
HumanEval Coding ability - generating correct Python functions | 96.1% | #5 / 49 |
MATH Competition-level mathematics problems | 98.4% | #5 / 49 |
HellaSwag Common sense reasoning about everyday situations | 97.2% | #5 / 36 |
Arena Elo Human preference ranking via blind comparisons | 1483 | #5 / 41 |
GPQA PhD-level science questions even experts struggle with | 88% | #15 / 54 |
SWE-bench Real-world GitHub issue resolution | 74.6% | #16 / 38 |
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 80.6% | #16 / 31 |
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 84.2% | #18 / 30 |
MMMUvals.ai College-level multimodal reasoning across 30+ disciplines | 72.7% | #26 / 33 |
TerminalArtificial Analysis Agentic terminal coding tasks requiring multi-step execution | 24.2% | #27 / 37 |