1T MoE with 100-agent swarm coordination, 99% HumanEval
| Benchmark | Score | Rank |
|---|---|---|
HumanEval Coding ability - generating correct Python functions | 99% | #1 / 50 |
MATH Competition-level mathematics problems | 98% | #6 / 50 |
MMLU Tests knowledge across 57 subjects from STEM to humanities | 92% | #11 / 54 |
SWE-bench Real-world GitHub issue resolution | 76.8% | #12 / 40 |
Terminal Agentic terminal coding tasks requiring multi-step execution | 50.8% | #12 / 48 |
ARC-AGI Novel reasoning tasks requiring fluid intelligence | 12% | #18 / 23 |
Arena Elo Human preference ranking via blind comparisons | 1450 | #23 / 52 |
MMMUArtificial Analysis College-level multimodal reasoning across 30+ disciplines | 75.4% | #24 / 39 |
GPQA PhD-level science questions even experts struggle with | 87.6% | #25 / 64 |