I mean, given that o1-preview takes sometimes a minute to answer, I'd imagine that they could append the prompt "Write a program and run it as well" to double check itself. It seems like they just don't trust themselves enough to run code that they generate, even sandboxed.
yeah because now that we've all been asking about it, that answer is in its training data. the trick with LLMs is always "is the answer in the training data".
I think it'd be just too expensive to incorporate code-writing in CoT. Maybe once they implement having a cluster of different model sizes in one answer it'll work out.