The point of benchmarking that is checking for hallucinations and overfitting. Does the model actually check the picture to count the legs or does it just see it's a dog and answer four because it knows dogs usually has four legs?
It's a perfectly valid benchmark and very telling.