First multimodal GPT, passed the bar exam, defined the frontier for a year
| Benchmark | Score | Rank |
|---|---|---|
HellaSwag Common sense reasoning about everyday situations | 95.3% | #12 / 36 |
ARC-C Grade-school science questions requiring reasoning | 96.3% | #27 / 40 |
MMLU Tests knowledge across 57 subjects from STEM to humanities | 86.4% | #39 / 54 |
MATH Competition-level mathematics problems | 52.9% | #43 / 50 |
HumanEval Coding ability - generating correct Python functions | 67% | #44 / 50 |
GPQA PhD-level science questions even experts struggle with | 35.7% | #63 / 64 |