This argument is backwards. Humans don't measure the next token prediction ability of the agents they speak to, human or AI. We rate speakers on whether they seem to understand what we say in context and respond by contributing useful information and analysis.
The attributes you're saying can be inferred from known superior next token prediction ability are the things we can actually detect and measure, at least qualitatively. Next token prediction quality is not measurable by humans in any human-meaningful way. Improving test cross entropy by 50% doesn't mean anything to us. It is irrelevant except as a mechanism to train LLMs.