These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?
Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”
It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).