Successor to K2.5 with stronger reasoning and Arena Elo; trades some agentic score for benchmark gains
| Benchmark | Score | Rank |
|---|---|---|
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 86.8% | #8 / 40 |
GPQAArtificial Analysis PhD-level science questions even experts struggle with | 91.1% | #12 / 64 |
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 87.6% | #12 / 38 |
Arena EloArtificial Analysis Human preference ranking via blind comparisons | 1483 | #15 / 52 |
MMMUArtificial Analysis College-level multimodal reasoning across 30+ disciplines | 79.4% | #17 / 39 |
TerminalArtificial Analysis Agentic terminal coding tasks requiring multi-step execution | 43.9% | #17 / 48 |