Artificial Analysis isn't perfect, but it is an independent third party that actually runs the benchmarks themselves, and they use a wide range of benchmarks. It is a better automated litmus test than any other that I've been able to find in years of watching the development of LLMs.
And the gap has been rapidly shrinking: https://www.youtube.com/watch?v=0NBILspM4c4&t=642s