I made my own benchmarks, very basic questions, and Claude 4.6 is actually worse than the free Stepfun 3.5 version: https://aibenchy.com
It is smart, but it fails at basic instruction following sometimes.
I remember this is a Claude thing for quite a while, where I kept trying to make it output just JSON (without structured output), and it always kept adding quotes or new lines.
After looking more into it, Claude DOES give the correct answer, just not in the format that it's asked, it always adds more info at the end, even when asked to just give the answer...
What do you mean? You can force JSON with structured output.
It was just an example though, in real-world scenarios, sometimes I have to tell the AI to respond in a specific strict format, which is not JSON (e.g. asking it to end with "Good bye!"). Claude is the one who is the worst at following those type of instructions, and because of this it fails to return to correct answer in the correct format, even though the answer itself is good.