This is why I’m a bit skeptical of the o3 results. If it’s spending a bunch of time reasoning aren’t the chances of it simply regurgitating a solution it saw in its training data at some point in its output stream higher? It still needs to be clever enough to identify it as the correct answer but it’s not as impressive as an original solution.
I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check.