I basically class LLM queries into two categories, there's stuff i expect most models to get, and there's stuff i expect only the smartest models to have a shot of getting right, there's some stuff in the middle ground that a quant model running locally might not get but something dumb but acceptable like Sonnet 4.5 or Kimi K2 might be able to handle.
I generally just stick to the two extremes and route my queries accordingly. I've been burned by sonnet 4.5/gpt-5 too many times to trust it.