You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.
I do find, however, that the newer the model the fewer elementary mistakes it makes, and the better it is at figuring out what I really want. The process of getting the right answer or the working function continues to become less frustrating over time, although not always monotonically so.
o1-pro is expensive and slow, for instance, but its performance on tasks that require step-by-step reasoning is just astonishing. As long as things keep moving in that direction I'm not going to complain (much).