Largest dense model at launch, first to show chain-of-thought reasoning at scale
| Benchmark | Score | Rank |
|---|---|---|
HellaSwag Common sense reasoning about everyday situations | 83.4% | #34 / 36 |
ARC-C Grade-school science questions requiring reasoning | 84.6% | #38 / 40 |
HumanEval Coding ability - generating correct Python functions | 26.2% | #48 / 49 |
MATH Competition-level mathematics problems | 34.8% | #48 / 49 |
MMLU Tests knowledge across 57 subjects from STEM to humanities | 69.3% | #50 / 53 |