I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
| Name | Semi-private eval | Public eval |
|--------------------------------------|-------------------|-------------|
| Jeremy Berman | 53.6% | 58.5% |
| Akyürek et al. | 47.5% | 62.8% |
| Ryan Greenblatt | 43% | 42% |
| OpenAI o1-preview (pass@1) | 18% | 21% |
| Anthropic Claude 3.5 Sonnet (pass@1) | 14% | 21% |
| OpenAI GPT-4o (pass@1) | 5% | 9% |
| Google Gemini 1.5 (pass@1) | 4.5% | 8% |
https://arxiv.org/pdf/2412.04604 o3 (coming soon) 75.7% 82.8%
o1-preview 18% 21%
Claude 3.5 Sonnet 14% 21%
GPT-4o 5% 9%
Gemini 1.5 4.5% 8%
Score (semi-private eval) / Score (public eval)That being said, the fact that this is not a "raw" base model, but one tuned on the ARC-AGI tests distribution takes away from the impressiveness of the result — How much ? — I'm not sure, we'd need the un-tuned base o3 model score for that.
In the meantime, comparing this tuned o3 model to other un-tuned base models is unfair (apples-to-oranges kind of comparison).