Long thinking seems to be a marketing term without clear definition, only applicable to the opaque chat frontends. If you give Anthropic models a hard problem and set the thinking budget high (API), it does plenty of reasoning and the CoT helps a lot with debugging. With Gemini and OpenAI you can't debug as the summaries tell you effectively nothing about why it's giving a wrong answer or going off the rails when it does som