First multimodal GPT, passed the bar exam, defined the frontier for a year
| Benchmark | Score | Rank |
|---|---|---|
HellaSwag Common sense reasoning about everyday situations | 95.3% | #12 / 36 |
ARC-C Grade-school science questions requiring reasoning | 96.3% | #27 / 40 |
MMLU Tests knowledge across 57 subjects from STEM to humanities | 86.4% | #38 / 53 |
MATH Competition-level mathematics problems | 52.9% | #42 / 49 |
HumanEval Coding ability - generating correct Python functions | 67% | #43 / 49 |
GPQA PhD-level science questions even experts struggle with | 35.7% | #53 / 54 |