Successor to K2.5 with stronger reasoning and Arena Elo; trades some agentic score for benchmark gains
| Benchmark | Score | Rank |
|---|---|---|
GPQAArtificial Analysis PhD-level science questions even experts struggle with | 91.1% | #8 / 56 |
Arena EloArtificial Analysis Human preference ranking via blind comparisons | 1483 | #8 / 44 |
MMMUArtificial Analysis College-level multimodal reasoning across 30+ disciplines | 79.4% | #13 / 34 |
TerminalArtificial Analysis Agentic terminal coding tasks requiring multi-step execution | 43.9% | #13 / 40 |