Everytime I've tried a local model, and I have tried lots for a couple years now, they just seem like they were overtrained on benchmarks. They consistently perform dramatically worse than even older models from Anthropic/OAI/Google.
That might be true, but still: with Claude Opus I can give a task with 2 lines and it will just do it, with a local Qwen I have to use plan mode for everything even small tasks.