Like, I think there's definitely value in prompting a dozen LLMs with a detailed description of a CMS you want built with 12 specific features, a unit testing suite and mobile support, and then timing them to see how long they take and grading their results. But that's not how most developers use an LLM in practice.
Until LLMs become reliable one-shot machines, the thing I care most about is how well they augment my problem solving process as I work through a problem with them. I have no earthly idea of how to measure that, and I'm highly skeptical of anyone who claims they do. In the absence of empirical evidence we have to fall back on intuition.