Reasoning-focused flagship, competitive with frontier Western models
| Benchmark | Score | Rank |
|---|---|---|
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 87% | #19 / 38 |
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 81.8% | #22 / 40 |
GPQAArtificial Analysis PhD-level science questions even experts struggle with | 83% | #36 / 64 |
TerminalArtificial Analysis Agentic terminal coding tasks requiring multi-step execution | 28.8% | #36 / 48 |