Flash-tier model with near-frontier reasoning: GPQA 92.2, ARC-AGI-1 96%. AA Intelligence Index 55.3 — top of the cost-efficient tier.
| Benchmark | Score | Rank |
|---|---|---|
MMMUvals.ai College-level multimodal reasoning across 30+ disciplines | 88.3% | #2 / 39 |
Arena EloArtificial Analysis Human preference ranking via blind comparisons | 1655 | #4 / 52 |
LiveCodeBenchvals.ai Contamination-free competitive programming (filtered by cutoff date) | 87.6% | #4 / 40 |
ARC-AGIARC Prize Novel reasoning tasks requiring fluid intelligence | 72.1% | #5 / 23 |
MMLU-Provals.ai Harder 10-option successor to MMLU; more reasoning-focused | 89.5% | #5 / 38 |
GPQAArtificial Analysis PhD-level science questions even experts struggle with | 92.2% | #8 / 64 |
TerminalArtificial Analysis Agentic terminal coding tasks requiring multi-step execution | 40.9% | #23 / 48 |