It's not comparing your response to some hard truth, it's comparing your response to a typical response. Sort of like how LLMs dish stuff out based on what's probable, not based on hard truth.
So when you fail, it's not really saying you're wrong, it's saying you're not like most.