I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
| Name | Semi-private eval | Public eval | |--------------------------------------|-------------------|-------------| | Jeremy Berman | 53.6% | 58.5% | | Akyürek et al. | 47.5% | 62.8% | | Ryan Greenblatt | 43% | 42% | | OpenAI o1-preview (pass@1) | 18% | 21% | | Anthropic Claude 3.5 Sonnet (pass@1) | 14% | 21% | | OpenAI GPT-4o (pass@1) | 5% | 9% | | Google Gemini 1.5 (pass@1) | 4.5% | 8% |
o3 (coming soon) 75.7% 82.8% o1-preview 18% 21% Claude 3.5 Sonnet 14% 21% GPT-4o 5% 9% Gemini 1.5 4.5% 8%
[1]: https://arcprize.org/2024-results