I tried it both ways; with individual prompts and prompts in bulk. I ran both tests the same way. There's a tradeoff in writing a legible/interesting blog post and relating step-by-step the way the evaluation was ran! Appreciate you reading and the feedback :)
Based on that it looks like the author asked all 25 test puzzles in one big prompt, which one supposes would favor larger models. To compare "puzzle solving" you'd think it would make more sense to ask one puzzle at a time?