They could publish weekly benchmarks. To disprove. They almost certainly have internal benchmarking.
The shift is certainly real. It might not be model performance but contextual changes or token performance (tasks take longer even if the model stays the same).