This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
(there's a table which shows comparison between vendors)
Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
edit:
I see, croft [2] 8bit for $0.50/$0.08/$2.20
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
https://docs.z.ai/devpack/tool/claude
Here's my setup. I add this to my .bashrc
export ZAI_API_KEY="your_key_here"
alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'
Then I just run claudez
pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for you haha
There's ZCode (https://zcode.z.ai). Which is like the Codex App.
That's as "easy" as it is for non-devs that you're complaining about.
Yes, there is. It's called Claude Code. Point it at the HuggingFace URL and say "Download these weights and build whatever is needed to run them, then test the model."
I'd pay for an out of the box solution. i.e. an Installer with updates
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
Now, maybe GLM 5.2 is close to Opus 4.7, but I don't wanna keep checking them and keep finding that they're still benchmaxing and aren't at GPT (my choice) or Opus level. The boy who cried wolf, I guess.
1. Keeping your data private on in the US
2. Not training on it
3. Not quantizing the model
4. Offer reasonable latency adds rate limits
With that said, I'm excited to try GLM 5.2 because I still end up reaching for Opus and GPT 5.5 for many tasks because the open models tend to get stuck more often on complex problems.
https://github.com/QuantiusBenignus/Zshelf/discussions/2
Not accounting for hardware, of course :)
link?
> Why
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
This is insane! I can't wait until technology progresses to the point we can run these things on consumer hardware!
IMHO it's already surpassed them. I vastly prefer my personal GLM and OpenCode setup to the Claude Code and Opus one that I have to use at work. The former makes way fewer StackOverflow brogrammer-tier mistakes and is considerably better at following instructions. The harness UX is also vastly superior as it doesn't ignore, randomly change, or incorrectly report settings.
Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating.
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
Seems writers block also effects LLM
In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.
Don't have any evals indicating how it compares on upper-bound quality, but for a well-defined task it seems like GLM 5.2 on "High" is remarkably token efficient. Looking forward to seeing where it lands on the AA index.
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
And this was high, not max.
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
score age size name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
To see everything, run it like so $ curl day50.dev/art-analysis.sh | bash
The repo: https://github.com/day50-dev/aa-eval-emailsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Are the scores here normalized such that each point difference is equidistant?
rank score age size name
1 62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
2 59.1 55 - GPT-5.5 (xhigh)
3 58.5 55 - GPT-5.5 (high)
4 57.2 104 - GPT-5.4 (xhigh)
5 56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
6 55.5 118 - Gemini 3.1 Pro Preview
7 53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
8 53.1 132 - GPT-5.3 Codex (xhigh)
9 52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
10 51.5 92 - GPT-5.4 mini (xhigh)
11 50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
12 50.7 1 large GLM-5.2 (max)
13 50.1 29 - Qwen3.7 Max
14 48.7 188 - GPT-5.2 (xhigh)
15 48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
16 47.8 205 - Claude Opus 4.5 (Reasoning)
17 47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
18 47.5 70 - Muse Spark
19 47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
20 47.1 58 large Kimi K2.6
21 47.1 29 - Gemini 3.5 Flash (minimal)
22 46.7 449 - Gemini 2.5 Pro Preview (Mar' 25)
23 46.5 211 - Gemini 3 Pro Preview (high)
24 46.5 16 - Qwen3.7 Plus
25 46.4 120 - Claude Sonnet 4.6 (Non-reasoning, High Effort)
26 45.6 5 large Kimi K2.7 Code
27 45.6 104 - GPT-5.4 (low)
28 45.5 56 large MiMo-V2.5-Pro
29 45.1 43 - GPT-5.5 Instant (May 2026)
30 45.0 29 - Gemini 3.5 Flash (high)
31 44.9 58 - Qwen3.6 Max Preview
32 44.7 216 - GPT-5.1 (high)
33 44.2 188 - GPT-5.2 (medium)
34 44.2 126 large GLM-5 (Reasoning)
35 43.9 92 - GPT-5.4 nano (xhigh)
36 43.4 71 large GLM-5.1 (Reasoning)
37 43.4 16 large MiniMax-M3
38 43.2 54 large DeepSeek V4 Pro (Reasoning, High Effort)
39 43.0 188 - GPT-5.2 Codex (xhigh)
40 42.9 76 - Qwen3.6 Plus
41 42.9 205 - Claude Opus 4.5 (Non-reasoning)
42 42.6 182 - Gemini 3 Flash Preview (Reasoning)
43 42.2 99 - Grok 4.20 0309 (Reasoning)
44 42.1 56 large MiMo-V2.5
45 41.9 91 large MiniMax-M2.7
46 41.4 91 - MiMo-V2-Pro
47 41.3 121 large Qwen3.5 397B A17B (Reasoning)
48 41.0 48 - Grok 4.3 (high)
49 40.5 71 - Grok 4.20 0309 v2 (Reasoning)
50 40.5 342 - Grok 4
51 39.8 54 large DeepSeek V4 Flash (Reasoning, High Effort)
A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.
It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
I'm not accusing anyone specifically, but I've noticed Chinese bots swamping certain YouTube channels that, for example, cover US defense industry news. They'll downplay any and all technical advances, play up China's dominance, US cowardice, etc. All very transparent. I suspect some of the online conversation about open Chinese models is driven by that. How often do you see people talking about Mistral or Trinity? Never. Because they don't play that game.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.
GPT can find fault in everything and anything including its own work.
openai, google and anthropic subscriptions are not available with privacy.
looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.
so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.
Unless you're running it locally, aren't you just trusting some other entity?
Fable 5 is cool and all, but we have not yet seen GPT-5.6.
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
There they can deploy these models while using the existing legal frameworks.
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
Would need to be a pretty determined medium biz
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
Even the local models I run on my Mac are getting surprisingly good at that now.
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
Discovered today that they set reasoning effort to max by default. So that’s probably why
This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.
What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.
am i missing something?
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
I have been messing with an early NV4FP quant of GLM 5.2 and so far, that model in its Max setting outperforms GPT 5.5 on its default setting. But GPT 5.5 still pulls ahead once I crank up its own reasoning effort. I imagine the same is true of Opus 4.x but haven't pitted them against each other yet.
GLM-5.2 is already close to Opus-4.7 level:
https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
Here are the results compared to Gemini 3.5 Flash:
Model + config CodeErr/gen Cost/gen Median time Quality
gemini-3.5-flash, low 0.71 $0.18 68s baseline
GLM 5.2, reasoning high 0.61 $0.18 289s -6.0%
GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
<0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
<0.4 → Weak – Partially relevant; significant omissions or errors.
<0.6 → Fair – Covers main points but lacks completeness or precision.
<0.8 → Good – Mostly accurate; minor gaps or deviations.
<=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.
Here is the scenario list (prompts are much more detailed): dragon-bottle-stopper
editing-param-mid-conv
editing-parametric-enclosure
editing-swap-material-param
editing-text-edit-cube
multi-turn-bird-house
multi-turn-dice-tower
multi-turn-modular-planter
multi-turn-phone-stand
multi-turn-shelf
one-shot-bookend
one-shot-cable-clip
one-shot-chess-queen
one-shot-coaster
one-shot-coffee-cup
one-shot-dog-tag
one-shot-dragon-figurine
one-shot-hex-bracket
one-shot-keychain-fob
one-shot-low-poly-tree
one-shot-pegboard-hook
one-shot-pi4-case
one-shot-threaded-jar
[0]: https://grandpacad.comEdit: Surprisingly very good results with 3.0 flash with high thinking.
Cost: $0.06
Duration: 3.22 min
Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)
Adherence was on par with 3.5 flash Low thinking
I work on mid-sized projects currently (200k to 1kk lines of code).
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg
IFBench: https://i.snipboard.io/74kg0R.jpg
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
OpenAI has big incentives to improve general interligence as a large percentage of users use ChatGPT for support, finances, questions, etc. Not just coding.
I tested this myself a few months ago, and confirmed that it was really happening.
LLMs don't know who they are unless the system prompt tells them, and as all of them are trained on model responses that exist on the web that end up being scraped, the weights may predict a certain incorrect response. LLMs have no ability to introspect, and do not know anything about themselves, so they will hallucinate in response to that question unless they are carefully trained on that exact, pointless question.
Data at https://gertlabs.com/rankings
We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.
There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)
As far as they go, though, these harder benchmarks match my experience more closely:
and https://cognition.ai/blog/frontier-code
Where we see "top" models drop way down in score when given longer tasks.
That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)
By the time I'm done testing all the Chinese models, they'll be obsolete :)
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
My workflow is usually:
- read file. I want to achieve X, how do? Do not implement anything.
- I would do a, b and c
- sketch a brief implementation of your suggestion
- <code> (not writing files yet)
- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?
- <code>
- nice, implement this
- starts writing files, run tests, etc.
I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.
Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
https://swelljoe.com/post/will-it-mythos/
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
I haven't extensively used 5.2 yet, but it seems a lot better.
- codex 5.5 medium - best results less hand holding medium speed
- opus 4.8 max - mediocre with hand holding medium speed
- glm 5.2 max - mediocre with hand holding and super slow
- composer 2.5 - mediocre with hand holding and super fast
I use all, since i run mulitple coding in parallel. disclosure - I use rexide which we created for all these agents to run in parallel with good visibility and feedback.
The requirements to run this model locally: https://www.reddit.com/r/LocalLLaMA/comments/1u8ai2a/glm52_i...
I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.
This is silly but I dig how 753 is very close to 745, which is the watts in a HP. 1bHP parameter model. Silly, but I enjoy it.
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
Their servers are melting though - getting more timeouts etc
That is unfortunate...