Flagship successor to V3 (Max Effort reasoning). Open-weight, competitive with closed frontier at a fraction of the cost. Awaiting Agentic and Multimodal eval coverage.
| Benchmark | Score | Rank |
|---|---|---|
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 87.5% | #3 / 33 |
Arena EloArtificial Analysis Human preference ranking via blind comparisons | 1553 | #3 / 46 |
GPQAArtificial Analysis PhD-level science questions even experts struggle with | 88.8% | #11 / 58 |
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 87.2% | #13 / 32 |