Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
- Tool calling doesn't work properly with OpenCode
- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher
- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md
I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.
> Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.
This user has also done a bunch of good quants:
The flash model in this thread is more than 10x smaller (30B).
And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
However they are using more thinking internally and that makes them seem slow.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.
It's even cheaper to just use it through z.ai themselves I think.
Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.
With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.
I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.
Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.
They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)
This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.
But they're nowhere near Opus 4.5
This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
You can get LLM as a service for cheaper.
E.g. This model costs less than a tenth of Haiku 4.5.
https://github.com/ggml-org/llama.cpp/releases
https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...
llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf
You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completionsSeems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M
It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.comHEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.
BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.
┌─────────┬───────────┬──────────┬───────────┐
│ │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
├─────────┼───────────┼──────────┼───────────┤
│ Tok/sec │ 13-17 │ 11-19 │ 32 │
├─────────┼───────────┼──────────┼───────────┤
│ Memory │ ~62GB │ ~28GB │ ~32GB │
└─────────┴───────────┴──────────┴───────────┘
Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do. We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.
The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.
https://docs.z.ai/release-notes/new-releasedGLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.
https://x.com/natebrake/status/2013978241573204246
Thus far, the 6-bit quant MLX weights were too much and crashed LMS with OOM
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
Not for code. The quality is so low, it's roughly on par with Sonnet 3.5
My recommendation would be to use other tools built to support pluggable model backends better. If you're looking for a Claude Code alternative, I've been liking OpenCode so far lately, and if you're looking for a Cursor alternative, I've heard great things about Roo/Cline/KiloCode although I personally still just use Continue out of habit.
https://huggingface.co/inference/models?model=zai-org%2FGLM-...
Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.
I am interesting if I can run it on a 24GB RTX 4090.
Also, would vllm be a good option?
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.
I'm thinking of giving it a go with aider, but using something like gemma3:27b as the architect. I don't think you can have different models for different skills in opencode, but with smaller local models I suspect it's unavoidable for now.
I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.
My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.
ssh admin.hotaisle.app
Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.