>What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.
That is quite literally what I have setup :)
I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time
>Do you think the choice of quantization matters that much for other models
It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.
Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.