It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.