I'm seeing some people say flash is amazing and can handle everything, and some say it's useless. It seems to depend on the task. I think it depends on the harness too (it works better in Claude Code in my experience, it's probably been trained on that).
it has limitations but it is way better than I expect from something named Flash that is open source.
My current workflow involves going from PRD -> execution plan -> build -> review, and this works nicely with open weight models like GLM 5.1, Kimi K2.6, and DeepSeek V4 Flash. With Opus I can generally skip the PRD entirely, and sometimes even skip the plan, and 80-90% of the time it does exactly what I want. But that can easily burn $5-15 for one feature, whereas it'll cost maybe $1-2 with the open weight models (at API pricing).
That's the main thing I've noticed. Small models can follow instructions just fine. If the instructions are very specific. Then I often have to spend more time explaining a task than it would have taken me to do it myself.
The bigger models have a lot more common sense.
I wonder if that could be improved slightly through prompting. Asking it to clarify anything that's confusing. Or maybe it just makes incorrect assumptions without realizing the ambiguity. One way to find out!
Though, I tend to use it as a pair programmer so just stop it and provide guidance.
The real problem is that it is excessively verbose - it's impossible to keep up with it's train of thought, and not practical to read it all. So I tend it just let it do it's thing then skim a bit and skip to the end for it's summary.
Try opencode go subscription - you get the Chinese models for 6x discount. I use like $1 a day...
30 day eval for each.
This is at least my experience with Claude Code as harness. Also, GLM pricing is not that far off from Claude. It's cheaper but not DeepSeek cheap.