I've tested old chats with the latest 4 and 4o models, and what had been zero-shot now sometimes can't even be done (or at least not without carefully guiding it to the answer).
My old chats say they have been migrated to 4o. But, I swear (can't confirm) that they perform better than a new 4o session. I haven't had time yet, but I wanted to side-by-side compare the responses from those old chats with the current 4o model.