1. Most variants on HIGH/XHIGH provide only marginal improvements in accuracy, but at drastically increased latency and cost. One special example is Gemini 3.1 Flash Lite, which on High used 1.5M reasoning tokens, and it's cost was 5x the one of running 5.3-Codex: https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...
2. On medium it seems like most models use a similar amount of reasoning tokens, this should be a more fair comparison.
3. Most models in the wild are used on medium (chat apps, default coding apps, tools, etc.).
4. Running on models on HIGH/XHIGH can lead to huge costs for me maintaining the test suite. I might add more models on high, if I can do it in a sustainable way.
5. Running models on HIGH would make running tests suites take much longer, so the results won't be published as fast.
6. Some models even show degradation when used on HIGH, as they tend to overthink/doubt themselves more. This seems to be a trend especially for new models, which wore trained to actually say "wait, but" quite a lot...
Overall, I am happy with how the current leaderboard/comparisons work. I might test some models on high, but for me, a better indication of true intelligence of a model/AGI is how well it does with "none"/no reasoning, than how well it does with high.