Really ? This happens plenty with human testing. Humans aren't general ?
The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.
I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.