Qwen and Gemma are great, but they need babysitting every 30 mins, which is quite a cognitive load.
I think theyre absolutely needed. I can't afford 200 USD a month for personal use of coding AI, and I don't think such prices are reasonable for most of the world economy anyway. Not to mention US firms might be giving their employees a lot more than that.
It's increasingly feeling, to me, that theres a gap building up between haves and have nots. But then, we get news of these open weight models that are reasonably priced in inference with reasonable capabilities. Yes, they take maybe 6-9 months to get there, tbh, that's not a bad trade off at all.
I subscribed to their max plan to try it out. It counted me 700M tokens and drained my weekly quota in under 2 days.
Quota just reset less than 24h ago and i'm already >60% weekly quota usage.
For reference the kind of work i did would have used somewhere between 3% and 5% of Codex max or Claude max.
The model is good, the plan is a scam
I do think the Chinese models are good enough for an 80/20 rule use case.
Will they still rent out their own model, will they support the open model and become a resource provider? Will they be able to repay the billions of dollars ?
This is probably the first question I would ask someone from Anthropic, if I ever meet one.
Been playing with GLM 5.2 in different contexts. It's less good if you don't max out thinking, but as xhigh it's been able to solve most problems I was throwing at Opus in the about the same amount of time (via OpenRouter).
Wild time to be alive.
1. SWE-bench Pro
Model Score (%)
GLM-5.2 62.1
GLM-5.1 58.4
Claude Opus 4.8 69.2
GPT-5.5 58.6
Gemini 3.1 Pro 54.2
2. Terminal-Bench 2.1
Model Score (%)
GLM-5.2 81.0
GLM-5.1 63.5
Claude Opus 4.8 85.0
GPT-5.5 84.0
Gemini 3.1 Pro 74.0
3. NL2Repo
Model Score (%)
GLM-5.2 48.9
GLM-5.1 42.7
Claude Opus 4.8 69.7
GPT-5.5 50.7
Gemini 3.1 Pro 33.4
4. DeepSWE
Model Score (%)
GLM-5.2 46.2
GLM-5.1 18.0
Claude Opus 4.8 58.0
GPT-5.5 70.0
Gemini 3.1 Pro 10.0
5. ProgramBench
Model Score (%)
GLM-5.2 63.7
GLM-5.1 50.9
Claude Opus 4.8 71.9
GPT-5.5 70.8
Gemini 3.1 Pro 39.5
6. MCP-Atlas
Model Score (%)
GLM-5.2 77.0
GLM-5.1 71.8
Claude Opus 4.8 77.8
GPT-5.5 75.3
Gemini 3.1 Pro 69.2
7. Tool-Decathlon
Model Score (%)
GLM-5.2 48.2
GLM-5.1 40.7
Claude Opus 4.8 59.9
GPT-5.5 55.6
Gemini 3.1 Pro 48.8
8. Humanity's Last Exam
Model Base Score (%) Score w/ Tools (%)
GLM-5.2 40.5 54.7
GLM-5.1 31.0 52.3
Claude Opus 4.8 49.8 57.9
GPT-5.5 41.4 52.2
Gemini 3.1 Pro 45.0 51.4
Seems to be handily beating Gemini 3.1 Pro. What _is_ Google DeepMind doing (other than bleeding talent to A\ ) ?For coding I still use 5.5 w/ Codex and prefer that to other models + harness combinations.
But the reasoning traces became increasingly hilarious, with it getting confused and going in loops, doubting itself. I began to feel almost sad, it was like listening to the internal monologue of someone with anxiety disorder.
It made pretty good progress but wound up going in a lot of goofy loops and doing things a bit "off" from standards I'd hoped it would infer, and finally started going a bit nuts, "This is very confusing.", "OH WAIT", seemingly hallucinating a whole side-quest that didn't make sense and looking at making internal system changes to try to achieve its (now very confused) goal when I pulled the plug.
Without seeing the reasoning traces from Claude/GPT it's hard to really know, but it definitely didn't feel like the same quality of reasoning, even if dogged persistence does wind up actually working eventually.
Is 2 better than x.ai
Perhaps it is just my harness and workflow, but the older model still seems to work better. Also the token cost is significantly lower. I rarely spend more than $20 a week with $50 cap. Not even half claudes ambiguous minimum $200 a month plan.
At the end of the day, open weights should be seen as nothing more than information (just more just numbers afterall), and so organisations like the EFF should sue for any restricting of the 1st Amendment