Whenever I try and use a "state of the art" LLM to generate code it takes longer to get a worse result than if I just wrote the code myself from the start. That's the experience of every good dev I know. So that's my benchmark. AI benchmarks are BS marketing gimmicks designed to give the appearance of progress - there are tremendous perverse financial incentives.
This will never change because you can only use an LLM to generate code (or any other type of output) you already know how to produce and are expert at - because you can never trust the output.