1T MoE with 100-agent swarm coordination, 99% HumanEval
| Benchmark | Score | Rank |
|---|---|---|
HumanEval Coding ability - generating correct Python functions | 99% | #1 / 49 |
MATH Competition-level mathematics problems | 98% | #6 / 49 |
Terminal Agentic terminal coding tasks requiring multi-step execution | 50.8% | #9 / 37 |
SWE-bench Real-world GitHub issue resolution | 76.8% | #10 / 38 |
MMLU Tests knowledge across 57 subjects from STEM to humanities | 92% | #11 / 53 |
Arena Elo Human preference ranking via blind comparisons | 1450 | #13 / 41 |
ARC-AGI Novel reasoning tasks requiring fluid intelligence | 12% | #16 / 21 |
GPQA PhD-level science questions even experts struggle with | 87.6% | #17 / 54 |
MMMUArtificial Analysis College-level multimodal reasoning across 30+ disciplines | 75.4% | #19 / 33 |