Flagship successor to V3 (Max Effort reasoning). Open-weight, competitive with closed frontier at a fraction of the cost. Awaiting Multimodal eval coverage.
| Benchmark | Score | Rank |
|---|---|---|
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 87.5% | #5 / 40 |
Arena EloArtificial Analysis Human preference ranking via blind comparisons | 1553 | #5 / 52 |
TerminalArtificial Analysis Agentic terminal coding tasks requiring multi-step execution | 46.2% | #14 / 48 |
GPQAArtificial Analysis PhD-level science questions even experts struggle with | 88.8% | #15 / 64 |
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 87.2% | #17 / 38 |