The problem even attempting to develop a tool for the frontier model space is that the cost to run a statistically significant benchmark is
almost certainly going to be over $100 - for a single model.
Unless something is like 25%+ more cost effective on Gemini for a task, I would not assume those savings are going to transfer to GPT.
If you need to run a test this expensive and slow for every release, hobbiests aren't going to do it.
And if you wanted any broadly specific improvements to coding like they all claim, the costs would be in the thousands per release even for a single for a single model.
And they almost certainly would not be eye popping.
If the models could be SUBSTANTIALLY better, Google and Anthropic and OpenAI wouldn't be finding that out from a hobbiest making wildly unscientific claims.