GLM-5.2 is the new leading open weights model on Artificial Analysis (opens in new tab)

(artificialanalysis.ai)

890 pointshimata41135d ago442 comments

442 comments

221 comments · 59 top-level

unrvl225d ago· 29 in thread

Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)

This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.

CuriouslyC5d ago

Be careful about unofficial providers, a lot of them misconfigure models or stealth quantize them. For a while the difference between Kimi on the official API and most third party providers was 20-40%.

thehamkercat5d ago

Kimi K2 had a vendor verifier: https://github.com/MoonshotAI/K2-Vendor-Verifier

(there's a table which shows comparison between vendors)

Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier

cedws5d ago

OpenRouter should be penalising or banning for this.

3 more replies

unrvl225d ago

the 2 I mentioned both have a fairly large following, who run benchmarks and absolutely will spot issues.

stanac5d ago

> Some are even offering API rates at 3x lower than the official ZAI api rates

Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.

edit:

I see, croft [2] 8bit for $0.50/$0.08/$2.20

[1]: https://openrouter.ai/z-ai/glm-5.2

[2]: https://ai.nahcrof.com/pricing

benjiro295d ago

Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.

I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.

Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.

1 more reply

scrlk5d ago

IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.

3 more replies

Schiendelman5d ago

To answer the question in your first sentence - because it's VERY computationally (ha) expensive as a human being to keep up with all the options. It's also very hard to figure out how to run a model like this. There's no installer. If you really really care, which 99% of people do not, you have to google a guide, and then find out it's out of date...

I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.

andai5d ago

But it just works with Claude Code? They have a guide on their website.

https://docs.z.ai/devpack/tool/claude

Here's my setup. I add this to my .bashrc

export ZAI_API_KEY="your_key_here"

alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'

Then I just run claudez

pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api

Even more pro tip: Claude Code can set this up for you haha

2 more replies

re-thc5d ago

> There's no installer.

There's ZCode (https://zcode.z.ai). Which is like the Codex App.

That's as "easy" as it is for non-devs that you're complaining about.

2 more replies

CamperBob25d ago

It's also very hard to figure out how to run a model like this. There's no installer.

Yes, there is. It's called Claude Code. Point it at the HuggingFace URL and say "Download these weights and build whatever is needed to run them, then test the model."

1 more reply

chillfox5d ago

install opencode, then either pay $10 for their plan, or add an openrouter api key.

gerryf25d ago

I agree with this.

I'd pay for an out of the box solution. i.e. an Installer with updates

cedws5d ago

In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.

unrvl225d ago

I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.

spelk5d ago

Which coding plan are you using? How are you finding it?

embedding-shape5d ago

> Why aren't more people talking about this?

Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.

sinatra5d ago

I've tried Chinese open models few times before. They were fine, but they didn't come close to the benchmarks they were claiming.

Now, maybe GLM 5.2 is close to Opus 4.7, but I don't wanna keep checking them and keep finding that they're still benchmaxing and aren't at GPT (my choice) or Opus level. The boy who cried wolf, I guess.

enraged_camel4d ago

Yes, my experience has been the same as yours. I find that the performance of open models is quite acceptable, even good, at one-off questions or small tasks. But they are quite unreliable at long horizon goals.

shostack5d ago

Which of those providers are:

1. Keeping your data private on in the US

2. Not training on it

3. Not quantizing the model

4. Offer reasonable latency adds rate limits

SyneRyder4d ago

OpenRouter has a list of providers, looks like NovitaAI would meet those criteria. Though not for $50/mth for 80/M tokens, which I assume is the Z.ai subscription pricing.

https://openrouter.ai/z-ai/glm-5.2

https://novita.ai/models/model-detail/zai-org-glm-5.2

knollimar5d ago

Isn't it closer to sonnet?

RussianCow5d ago

The Chinese open weight models have been ahead of Sonnet (at least for coding) for a couple months now. I tend to take benchmarks with a huge grain of salt, but in my own experience, the latest versions of Kimi, MiMo, and GLM (pre-5.2) had already surpassed Sonnet in terms of output quality for a fraction of the price.

With that said, I'm excited to try GLM 5.2 because I still end up reaching for Opus and GPT 5.5 for many tasks because the open models tend to get stuck more often on complex problems.

1 more reply

redox995d ago

Definitely opus level for coding.

2 more replies

Hamuko5d ago

I’m not that interested in models that I can’t run on my desktop for ~0€, which is my AI budget.

andai5d ago

Electricity cost seems to be about $30/month for a 32B model on a GPU. It's probably better on Apple hardware.

https://github.com/QuantiusBenignus/Zshelf/discussions/2

Not accounting for hardware, of course :)

2 more replies

igravious5d ago

Cool beans. You're not the target audience then.

1 more reply

anuramat5d ago

> unlimited tokens for $50 a month

link?

> Why

imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"

more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench

CuriouslyC5d ago

Opus has the nickname "Slopus" in a lot of circles for a reason. It can write nice code in isolation, but the way it organizes that code and its rigor in addressing edge cases/making sure things are robust leave a lot to be desired. Opus is particularly famous for having a real problem reinventing stuff that already existed in the codebase because it wanted to get to work before exploring sufficiently.

1 more reply

Tiberium5d ago· 28 in thread

It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.

I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.

Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.

Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

benjiro295d ago

GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.

If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.

In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.

There has been really no training on Opus models going on, really, none i tell you! /sarcasm

matheusmoreira5d ago

> GLM 5.2 Max = Opus 4.8 Max in thinking behavior

This is insane! I can't wait until technology progresses to the point we can run these things on consumer hardware!

3 more replies

FooBarWidget5d ago

With such ridiculously long thinking traces I'm surprised max outperforms high. After all, performance falls off a hill after a certain amount of context, and long thinking traces can fill that up really quickly.

maxdo5d ago

looking at the score this is rather a gemini 3.5 flash competitor, yes, for cheaper, but distance to opus and fable is as big as their price diff.

vitalyan1235d ago

distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.

11 more replies

alexjplant5d ago

> It seems to really be a nice step-up and is getting quite close to the frontier.

IMHO it's already surpassed them. I vastly prefer my personal GLM and OpenCode setup to the Claude Code and Opus one that I have to use at work. The former makes way fewer StackOverflow brogrammer-tier mistakes and is considerably better at following instructions. The harness UX is also vastly superior as it doesn't ignore, randomly change, or incorrectly report settings.

Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating.

carter20994d ago

You would be surprised at how much of an impact the harness has. I switched to Pi and chinese open source models, and models that _I know_ are less capable than sonnet outperform my sonnet + claude code stack at work.

vorticalbox5d ago

This is a problem I find with opus is will spend so long thinking then going “but wait what if”

To point where I stop it and simple tell it to “start writing code you can work it out as you go along”

Seems writers block also effects LLM

robertkarl5d ago

https://arxiv.org/abs/2606.00206

In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.

3 more replies

giancarlostoro5d ago

I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.

Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.

1 more reply

mikeocool5d ago

Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.

Just output the code and we’ll work through it!

I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.

1 more reply

epolanski5d ago

Fable was 20 times worse on that.

It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.

2 more replies

drob5185d ago

Qwen is notorious for this, too. It’ll sometimes spin in a long loop of “But wait…” paragraphs.

thinkingtoilet5d ago

I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.

h14h5d ago

Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs.

Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.

h14h3d ago

I've been doing some testing with GLM 5.2 on Fireworks and it looks like the "High" reasoning level uses fewer tokens than even K2.7 Code by a considerable margin (roughly half).

Don't have any evals indicating how it compares on upper-bound quality, but for a well-defined task it seems like GLM 5.2 on "High" is remarkably token efficient. Looking forward to seeing where it lands on the AA index.

bertili5d ago

This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.

[1] https://z.ai/blog/glm-5.2

Tiberium5d ago

Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high.

1 more reply

cmrdporcupine5d ago

> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.

And this was high, not max.

guelo5d ago

Using these open models really makes you realize how subsidized Anthropic and OpenAi's subscription plans are.

1 more reply

esafak5d ago

I agree. I've noticed that it is quite smart but it has a tendency to doubt itself and overthink. I monitor its internal dialogue and prod it when it does this. They need to optimize the chain of thought early stopping.

abgruszecki4d ago

Agreed that models should get better at working with rare programming languages like Nim! Using them tends to confuse agents a lot in general. We're working on a paper right now where we compare how token-efficient models are when trying to implement the exact same program in different programming languages, and that's one of the trends we're seeing.

robmccoll5d ago

That's interesting. I gave nearly the same task to Gemma4 31b as a test yesterday. Write a symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*(). It performed the task correctly with minimal reasoning - much fewer reasoning tokens than output tokens.

gbingles5d ago

Tbh, so what? I googled "symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*()" and got what looks to be viable answers without using any AI model at all. Reciting well established things from memory isn't terribly interesting. Show it a novel codebase and have it implement something within it.

2 more replies

rdsubhas5d ago

As per stats in other comments, it is frontier, not close to frontier.

xyzsparetimexyz5d ago

Reminiscent of https://en.wikipedia.org/wiki/Portia_(spider)

HWR_144d ago

I thought you could not compare tokens across models because their cost and speed was so different between models.

nurumaik5d ago

You asked for maximum effort, you got maximum effort

kristopolous5d ago· 16 in thread

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

papersail5d ago

  score  age  size   name
  62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  59.1   55   -      GPT-5.5 (xhigh)
  58.5   55   -      GPT-5.5 (high)
  57.2   104  -      GPT-5.4 (xhigh)
  56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  56.2   55   -      GPT-5.5 (medium)
  55.5   118  -      Gemini 3.1 Pro Preview
  53.1   132  -      GPT-5.3 Codex (xhigh)
  53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  52.1   55   -      GPT-5.5 (low)
  51.5   92   -      GPT-5.4 mini (xhigh)
  50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  50.7   1    large  GLM-5.2 (max)
  50.1   29   -      Qwen3.7 Max
  48.7   188  -      GPT-5.2 (xhigh)
  48.6   55   -      GPT-5.5 (Non-reasoning)
  48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  47.8   205  -      Claude Opus 4.5 (Reasoning)

tcp_handshaker5d ago

Short comments...

- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...

- China is going to eat the US lunch on AI

- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.

- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?

9 more replies

christoff125d ago

Lol thank you for sorting.

Are the scores here normalized such that each point difference is equidistant?

papersail5d ago

  rank  score  age  size   name
  1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  2     59.1   55   -      GPT-5.5 (xhigh)
  3     58.5   55   -      GPT-5.5 (high)
  4     57.2   104  -      GPT-5.4 (xhigh)
  5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  6     55.5   118  -      Gemini 3.1 Pro Preview
  7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  8     53.1   132  -      GPT-5.3 Codex (xhigh)
  9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  10    51.5   92   -      GPT-5.4 mini (xhigh)
  11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  12    50.7   1    large  GLM-5.2 (max)
  13    50.1   29   -      Qwen3.7 Max
  14    48.7   188  -      GPT-5.2 (xhigh)
  15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  16    47.8   205  -      Claude Opus 4.5 (Reasoning)
  17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
  18    47.5   70   -      Muse Spark
  19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
  20    47.1   58   large  Kimi K2.6
  21    47.1   29   -      Gemini 3.5 Flash (minimal)
  22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
  23    46.5   211  -      Gemini 3 Pro Preview (high)
  24    46.5   16   -      Qwen3.7 Plus
  25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
  26    45.6   5    large  Kimi K2.7 Code
  27    45.6   104  -      GPT-5.4 (low)
  28    45.5   56   large  MiMo-V2.5-Pro
  29    45.1   43   -      GPT-5.5 Instant (May 2026)
  30    45.0   29   -      Gemini 3.5 Flash (high)
  31    44.9   58   -      Qwen3.6 Max Preview
  32    44.7   216  -      GPT-5.1 (high)
  33    44.2   188  -      GPT-5.2 (medium)
  34    44.2   126  large  GLM-5 (Reasoning)
  35    43.9   92   -      GPT-5.4 nano (xhigh)
  36    43.4   71   large  GLM-5.1 (Reasoning)
  37    43.4   16   large  MiniMax-M3
  38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
  39    43.0   188  -      GPT-5.2 Codex (xhigh)
  40    42.9   76   -      Qwen3.6 Plus
  41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
  42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
  43    42.2   99   -      Grok 4.20 0309 (Reasoning)
  44    42.1   56   large  MiMo-V2.5
  45    41.9   91   large  MiniMax-M2.7
  46    41.4   91   -      MiMo-V2-Pro
  47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
  48    41.0   48   -      Grok 4.3 (high)
  49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
  50    40.5   342  -      Grok 4
  51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)

A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.

2 more replies

bel85d ago

you left some models out like DeepSeek and Kimi, for example.

2 more replies

alecco5d ago

Consider using decrementing score order (best on top)

kristopolous5d ago

then I'd have to scroll up over 500 lines after running it every time to see what I care about.

But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...

add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.

The original link has been updated accordingly with the new code.

1 more reply

sosodev5d ago

Note that AA's coding index is only made up of two benchmarks: Terminal-Bench Hard and SciCode. I'm skeptical that it makes a good coding index. It ranks Gemma 4 31B above Deepseek V4 Flash. Having used both of those models for a broad variety of coding tasks I would choose Deepseek every day.

bodhi_mind5d ago

Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.

slig5d ago

Thanks for sharing. I'm curious: why didn't you sort with the score descending?

kristopolous5d ago

Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?

2 more replies

fridder5d ago

Not OP but if you run this from the CLI it does make the ordering make a little more sense

snsnbsne5d ago

Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.

Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.

Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.

jarjoura5d ago

Seems legit. My experiments with GLM-5.2 so far have resulted in strange hallucinations in the tiniest of places. Like a wrong variable name.

It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.

scrollop5d ago

Would be interesting to see where gpt 5.5 pro extended is.

drob5185d ago

Maybe your script could sort based on score.

mrngld5d ago· 10 in thread

Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.

https://artificialanalysis.ai/agents/coding-agents?coding-ag...

I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.

undecidabot5d ago

It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.

[1] https://z.ai/blog/glm-5.2

mrngld5d ago

If that ends up being true, GPT5.5 at 70 (and presumably Fable a bit ahead of that) is still in a different league, which was partly my point. To listen to online chatter, GLM5.2 is a tectonic shift in the landscape. In reality, it's just interesting. Probably safe to bet once the DeepSWE benches all get fully updated it won't even be on the pareto frontier.

I'm not accusing anyone specifically, but I've noticed Chinese bots swamping certain YouTube channels that, for example, cover US defense industry news. They'll downplay any and all technical advances, play up China's dominance, US cowardice, etc. All very transparent. I suspect some of the online conversation about open Chinese models is driven by that. How often do you see people talking about Mistral or Trinity? Never. Because they don't play that game.

1 more reply

cmrdporcupine5d ago

I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.

It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

Having better luck with MiniMax M3, from a cost/benefit ratio.

pjerem5d ago

I really like DeepSeek V4 Pro. It's pretty smart and I get so much usage out of it on a $20 Ollama cloud plan.

With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.

zooming5d ago

Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.

1 more reply

re-thc5d ago

> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

GPT can find fault in everything and anything including its own work.

2 more replies

lukewarm7075d ago

with open models you can get a subscription with privacy, at the same cost as codex.

openai, google and anthropic subscriptions are not available with privacy.

looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.

so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.

vadansky5d ago

> with open models you can get a subscription with privacy

Unless you're running it locally, aren't you just trusting some other entity?

3 more replies

ttul5d ago

DeepSWE “feels” like the right benchmark in comparison to Artificial Analysis indices and other coding benchmarks. And by their metrics, GPT-5.5 is still king in token efficiency, speed, and overall intelligence per dollar.

https://deepswe.datacurve.ai/

Fable 5 is cool and all, but we have not yet seen GPT-5.6.

slagfart4d ago

GLM5.2 isn't even on this benchmark

1 more reply

CubsFan10605d ago· 10 in thread

Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.

wongarsu5d ago

I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)

Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic

user439285d ago

While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.

There they can deploy these models while using the existing legal frameworks.

CubsFan10605d ago

What kind of hardware/price does it take to run those?

2 more replies

moffkalast5d ago

So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.

MikhailTal5d ago

This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.

Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms

petesergeant5d ago

Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.

CubsFan10605d ago

I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.

tancop5d ago

if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.

1 more reply

Havoc5d ago

It’s a ~750B model so still a hell of a lot of vram

Would need to be a pretty determined medium biz

re-thc5d ago

> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

Years.

Even Microsoft said they don't have enough for Github and need to call Amazon.

Getting a few even at decent prices is hard. Unless the shortages goes down...

simonw5d ago· 9 in thread

I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.

That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.

In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.

Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.

0xbadcafebee5d ago

Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"

ricardobeat5d ago

That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.

1 more reply

WASDx5d ago

Are you suggesting it should summarize the image in text or generate it in HTML or something else?

_pdp_5d ago

I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience.

simonw5d ago

One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.

Even the local models I run on my Mac are getting surprisingly good at that now.

1 more reply

tiahura5d ago

Using llms to generate docx. Being able to rasterize and review is an important part of the process.

x3cca5d ago

I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited.

abby30104d ago

Agreed, that's actually one step that will make people adopt it widely for customer facing AI Agent!

ashenke5d ago

I had the same reaction with Deepseek V4 ! It would be more useful as a vision model

rahidz5d ago· 8 in thread

Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?

segmondy5d ago

DeepSeekv4+ will have image capability, they said so in their paper. GLM whenever they decide to. Both companies have they tech and for whatever reason haven't decide to prioritize it. Both of their OCR are SOTA among all OCR models closed or open. GLM demonstrated they know how to do this, with GLM-4.6V.

dryarzeg5d ago

Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).

mordae5d ago

They do not and it sucks for certain tasks.

It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.

osti5d ago

Many other open source models have vision but they don't compare to GLM in terms of coding quality. So I don't think it's because of vision that the frontier models are better, it's more that they are probably just much bigger models.

freigeist795d ago

it helps giving them a cli vision tool (curl to openrouter vision model for example)

adrian_b5d ago

That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.

With open weights LLMs, it is affordable to use many different models, each for whatever it is better.

Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.

0xbadcafebee5d ago

Configure a subagent in your coding harness for vision, add a prompt about the vision use, configure a vision model for it, modify your main agent's prompt to use the vision subagent for vision tasks. Now your non-vision model has vision support.

Havoc5d ago

They have a separate VL model but never tried it

CuriouslyC5d ago· 5 in thread

I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.

Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.

Havoc5d ago

> while being a little bit verbose

Discovered today that they set reasoning effort to max by default. So that’s probably why

igravious5d ago

After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …

elwebmaster5d ago

You are not alone. How about GPT 5.5? Does it come close to Fable 5?

2 more replies

sdesol5d ago

> GLM writing

This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.

What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.

andai5d ago

This is my workflow. And then once a day I copy paste the code into the free Claude Sonnet so it comes out actually readable.

tensegrist5d ago· 4 in thread

> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)

am i missing something?

OtherShrezzing5d ago

I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.

acchow5d ago

pareto frontier does not mean cheapest.

xiaoyu20065d ago

Some models are heavily subsidized. Total params & active params are better measurement of inference cost.

simianwords5d ago

No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)

1 more reply

Pragmata5d ago· 4 in thread

So this basically means we will have a near opus level model able to be run locally in the next couple of months right?

QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?

CamperBob25d ago

So much depends on the thinking effort, it's almost meaningless to compare these models without specifying it. GLM 5.2 needs to run with max thinking effort to be competitive with the leading-edge models from OpenAI and Anthropic. That slows it down quite a bit in my experience. Meanwhile, those models have thinking-effort knobs of their own that make a big difference, especially in GPT 5.5's case.

I have been messing with an early NV4FP quant of GLM 5.2 and so far, that model in its Max setting outperforms GPT 5.5 on its default setting. But GPT 5.5 still pulls ahead once I crank up its own reasoning effort. I imagine the same is true of Opus 4.x but haven't pitted them against each other yet.

XCSme5d ago

Which Opus?

GLM-5.2 is already close to Opus-4.7 level:

https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

XCSme5d ago

Oh, or you meant a smaller model than GLM-5.2 with similar capabilities?

2 more replies

segmondy5d ago

Why wait for the next few months? There are plenty of better models that you can run today locally. Qwen3.5-397B beats Qwen3.6-27B. MiniMax2.7 is a longrun horizon monster. (I haven't given 3 much of a try yet). KimiK2.6/2.7, MiMoV2.5/MiMoV2.5-Pro and GLM5.1 will wreck Qwen3.6-27B any day on any task.

1 more reply

ponyous5d ago· 4 in thread

Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.

Here are the results compared to Gemini 3.5 Flash:

    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%

Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.

Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.

NiloCK5d ago

Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?

I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like

- give 3d modelling task

- render and snapshot from a variety of angles

- feed to third-party vision model for a "what is this" type query

- grade on end-to-end accuracy

Bonus points for asking the vision model something like "how beautiful is this 1-10".

ponyous5d ago

I don't have the eval results live yet, so I cannot share them yet.

I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...

I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.

Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):

    <0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
    <0.4 → Weak – Partially relevant; significant omissions or errors.
    <0.6 → Fair – Covers main points but lacks completeness or precision.
    <0.8 → Good – Mostly accurate; minor gaps or deviations.
    <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.

Here is the scenario list (prompts are much more detailed):

    dragon-bottle-stopper
    editing-param-mid-conv
    editing-parametric-enclosure
    editing-swap-material-param
    editing-text-edit-cube
    multi-turn-bird-house
    multi-turn-dice-tower
    multi-turn-modular-planter
    multi-turn-phone-stand
    multi-turn-shelf
    one-shot-bookend
    one-shot-cable-clip
    one-shot-chess-queen
    one-shot-coaster
    one-shot-coffee-cup
    one-shot-dog-tag
    one-shot-dragon-figurine
    one-shot-hex-bracket
    one-shot-keychain-fob
    one-shot-low-poly-tree
    one-shot-pegboard-hook
    one-shot-pi4-case
    one-shot-threaded-jar

[0]: https://grandpacad.com

1 more reply

ComputerGuru5d ago

Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?

ponyous5d ago

Absolutely. Running it now, will update this comment in about 30 mins.

Edit: Surprisingly very good results with 3.0 flash with high thinking.

Cost: $0.06

Duration: 3.22 min

Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)

Adherence was on par with 3.5 flash Low thinking

1 more reply

JustSkyfall5d ago· 4 in thread

The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/

CuriouslyC5d ago

This was a problem with older Qwen/MiMo/Kimi models mostly. GLM has always been on the more robust side, and newer iterations from all those labs have improved as well. The only lab I've seen regressing this way is DeepSeek, 3.2 was fairly robust but 4.0 feels more benchmaxxed.

Mashimo5d ago

I have used GLM since version 4.8 I think and do enjoy using them. More then other models like Kimi or Deepseek. Though only tested them on smaller private projects.

1 more reply

bel85d ago

I beg to differ. I replaced a $40/mo GitHub Copilot subscription where I used Opus 4.6 and GPT 5.5 with a $10/mo opencode Go plan where I use mostly DeepSeek V4 Flash and testing MiMo 2.5.

I work on mid-sized projects currently (200k to 1kk lines of code).

1 more reply

segmondy5d ago

You are obviously lying because it shows you have no experience with. GLM since 4.5 have been crushing it. all their models since then haven't skipped a beat. 4.5/4.5-air, 4.6, 4.7, 4.8, 5, 5.1. That aside, MiMoV2.5, MiniMax from 2.0, DeepSeek from V3, Kimi since V2, Qwen since 3, Hy3 have all been amazing models. All from China, we need to get over it. China is not losing yet as far as the AI race is concerned.

1 more reply

kissgyorgy5d ago· 4 in thread

I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.

Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.

The model might be good, but if the API is so bad, it's effectively useless.

[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...

segmondy5d ago

The entire point of this post is that it's open weights, you can run it yourself and don't have to deal with the API issues. You really do have that choice.

1 more reply

Havoc5d ago

That’s what happens when you offer something decent at a fraction of the price of opus - more demand than you can serve

ComputerGuru5d ago

Give it a few days and additional provider will be up and available on OpenRouter. Then the game of figuring out who’s not nuking the weights and neutering the quantization begins.

osti5d ago

I indeed got a few timeouts yesterday using the official API, I imagine for the coding plan users it'll be even worse.

XCSme5d ago· 3 in thread

In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

XCSme5d ago

I think the problem is, as can also be seen on other benchmarks, is that most models nowadays are focused more and more purely on tool calling and coding.

This means, that models are losing more and more general and domain-specific knowledge.

Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:

AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg

IFBench: https://i.snipboard.io/74kg0R.jpg

I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.

HDBaseT5d ago

Well, in that example it still seems the big players are increasing overall "intelligence" as Fable tops the list.

OpenAI has big incentives to improve general interligence as a large percentage of users use ChatGPT for support, finances, questions, etc. Not just coding.

sourcecodeplz5d ago

man, i love dsv4-flash but i found its weaknesses in complex projects with multiple moving parts. tried kimi 2.6 and it understood and could work on the task. bigger is better..

hereme8885d ago· 3 in thread

Hmmm... GLM insists it's Gemini.

https://github.com/zai-org/GLM-5/issues/79

coder5435d ago

Claude Sonnet 4.6 identified itself as DeepSeek repeatedly: https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...

I tested this myself a few months ago, and confirmed that it was really happening.

LLMs don't know who they are unless the system prompt tells them, and as all of them are trained on model responses that exist on the web that end up being scraped, the weights may predict a certain incorrect response. LLMs have no ability to introspect, and do not know anything about themselves, so they will hallucinate in response to that question unless they are carefully trained on that exact, pointless question.

1 more reply

bityard5d ago

It's a surprisingly common misconception that models contain any metadata at all about themselves in their weights. If you ask them, "What model are you?" they either retrieve the answer from the system prompt, or they hallucinate an answer. Same goes for questions about knowledge cut-off, how many parameters they have, the source of their training data, etc.

1 more reply

adastra225d ago

Then why does it score better than any Gemini model?

1 more reply

gertlabs5d ago· 2 in thread

GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).

Data at https://gertlabs.com/rankings

nsoonhui5d ago

I really have to take your score with a grain of salt because Opus 4.5 does better than Opus 4.6

gertlabs4d ago

They're within confidence intervals of each other, but remember how much discussion there was that Opus 4.6 had been nerfed in March. We averaged samples over the entire lifetime of Opus 4.6, which likely served many different underlying checkpoints. Even the best version of Opus 4.6 was hardly an upgrade.

We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.

kingstnap5d ago· 2 in thread

According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.

Excited to see if this turns out to be a Open Weight Opus 4.5 or better.

andai5d ago

The only benchmarks that matters is your actual task.

I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.

There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)

As far as they go, though, these harder benchmarks match my experience more closely:

https://deepswe.datacurve.ai/

and https://cognition.ai/blog/frontier-code

Where we see "top" models drop way down in score when given longer tasks.

That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)

By the time I'm done testing all the Chinese models, they'll be obsolete :)

adastra225d ago

According to reports in this thread it is somewhere between Opus 4.7 and 4.8. This is effectively frontier.

_pdp_5d ago· 2 in thread

I am helpful.

DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.

LUmBULtERA5d ago

Your system prompt is showing.

kreddor5d ago

Maybe he meant "hopeful"...

1 more reply

daniban5d ago· 2 in thread

I'm curious what harness everyone is using for these? I want to start to test some of these open models but don't know what tools people use to get these working "agenticaly"

gorbypark5d ago

I am using OpenCode with the DeepSeek API with some pretty good results.

zackify5d ago

pi.dev and ask ai to add features you miss from claude or codex. i configure keyboard shortcuts and swap models easily

piterrro5d ago· 2 in thread

DeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.

enraged_camel5d ago

People always say stuff like this, but it is misleading. The reason it's misleading is because that remaining 5% makes a huge difference, and is where most of the value of using AI agents lies.

I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.

1 more reply

0xbadcafebee5d ago

....so use DeepSeek v4 Pro for 95% of your coding tasks, and GLM 5.2 for the other 5%? You don't need to stick to one model.

eckelhesten5d ago· 2 in thread

Sure, but whatever you do, don't buy their (Z.ai) lite plan.

I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.

granra5d ago

How are you using it? I have the lite plan and I've only ever maxed my weekly usage a few hours before reset. I will concede that I'm not a super heavy LLM user but it's been really good for me.

My workflow is usually:

- read file. I want to achieve X, how do? Do not implement anything.

- I would do a, b and c

- sketch a brief implementation of your suggestion

- <code> (not writing files yet)

- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?

- <code>

- nice, implement this

- starts writing files, run tests, etc.

1 more reply

Alifatisk5d ago

Did you consider their peak hours and model usage multiplier? Read the green box https://docs.z.ai/devpack/overview#usage-instruction

I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.

1 more reply

adithyaharish5d ago· 2 in thread

why do not all open source LLM's have open weights like this model?

bigyabai5d ago

https://en.wikipedia.org/wiki/Artificial_scarcity

Retro_Dev5d ago

"open source" means that the code itself (for LLMs - this is training code) is available to the general public. "open weights" means that the weights (trained over time) are available publicly, rather than locked behind a paywalled chat. I do not know of an open source LLM that is not also open weights (unless they never bothered training it). Models like Claude and Gemini are neither open source, nor are they open weights.

1 more reply

m-dot-reviews5d ago· 1 in thread

For anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/

The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.

I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.

swingboy5d ago

I get a 500 when clicking “Explore the Models”

1 more reply

Imustaskforhelp5d ago· 1 in thread

I have been trying out GLM 5.2 and I am really impressed by it for the most part.

To all people on Hackernews, I am curious as to what agent harness are you using it with.

Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.

Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.

I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?

Alifatisk5d ago

I just used CC with GLM, I was satisfied.

dizhn5d ago· 1 in thread

FYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en

Alifatisk5d ago

Where can I read more about the coming 3mil GLM 5.2?

1 more reply

guybedo5d ago· 1 in thread

It's probably a good model but they used GLM 5.1 to code their infra.

I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.

Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.

bigyabai5d ago

Are you using it for long context windows? I burn through my 5hr quota with GLM almost instantly on 200k+ contexts, but if I reset every ~100k or so it's much more manageable.

RDTvlokip5d ago· 1 in thread

I have a question, as it happens: Do you think the benchmarks and models were trained on benchmark datasets to skew the results, even though in real-world applications we realize they're not that great?

sinuhe695d ago

Recent incident with the Rio 3.5 model clearly shows that many coding models are specifically trained/fine tuned for the benchmarks.

1 more reply

lousken5d ago· 1 in thread

Cerebras really needs to have this on their API list (if they even still exist).

Marciplan5d ago

they went public a few weeks ago

1 more reply

sourcecodeplz5d ago· 1 in thread

1m context btw.

Alifatisk5d ago

And apparently, actual support for 1M context window, not just theoretical.

wongarsu5d ago

It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.

That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions

SwellJoe5d ago

I added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.

https://swelljoe.com/post/will-it-mythos/

(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)

1 more reply

xiaoyu20065d ago

This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.

osti5d ago

Fun fact: Zhipu aka Z.ai, Knowledge Atlas etc., the company that made GLM, is listed on Hong Kong stock exchange, is up over 10x since the IPO at the beginning of this year.

davidwritesbugs5d ago

I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days

leemoore5d ago

GLM 5.2 feels like Opus 4.6 level. I actually think 4.6 and GLM work better in practice than opus 4.7 or 4.8 as I find both of those more erratic and seem to randomly have a super dumb turn. That random bad turn I see doesn't seem to be hitting the benchmark scores but they make 4.7 and 4.8 very hard to use for me. GLM is more stable like opus 4.6

ramon1565d ago

I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.

I haven't extensively used 5.2 yet, but it seems a lot better.

tomerbd4d ago

I code daily with AI - real programming tasks, professional, real work, read customers, I use below 3:

- codex 5.5 medium - best results less hand holding medium speed

- opus 4.8 max - mediocre with hand holding medium speed

- glm 5.2 max - mediocre with hand holding and super slow

- composer 2.5 - mediocre with hand holding and super fast

I use all, since i run mulitple coding in parallel. disclosure - I use rexide which we created for all these agents to run in parallel with good visibility and feedback.

bizer4d ago

Z-ai/GLM’s KV caching technology is truly impressive; the implicit cache hit rate of its official API exceeds 95%, far surpassing other APIs that support implicit caching, such as Gemini and Qwen. I’ve been pondering the architectural design behind this, though I haven't yet formed a fully coherent theory.

redbell5d ago

Launch announcement from four days ago: https://news.ycombinator.com/item?id=48518684

The requirements to run this model locally: https://www.reddit.com/r/LocalLLaMA/comments/1u8ai2a/glm52_i...

mesmertech5d ago

Seems really good at frontend work, and as a result on remotion programmatic videos. Not the best yet, thats still Gemini 3.1 pro(trained on actual videos) or Fable, but often better than what Opus can come up with

https://mesmer.tools/benchmarks/ai-video-generation

gauravvij1374d ago

They've come along pretty far now.

I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.

jauntywundrkind5d ago

Also so wild that it's relatively compact. 753B-40A is so reasonable, shows incredible scaling in what the model can do, without just throwing heaps of new parameters in.

This is silly but I dig how 753 is very close to 745, which is the watts in a HP. 1bHP parameter model. Silly, but I enjoy it.

alansaber5d ago

These open source models need better multi-turn capabilities. They are always lacklustre in "agent mode". Whether it's just less RL, whatever, it's a worse "product". Whereas it feels like the frontier labs have been all-in on "agentic" multi-turn reasoning for a long time now.

aunty_helen4d ago

Before you go and sign up to the max plan like I did, they are obviously struggling for capacity. I'm getting API rate limited and 429'd on a simple "hello"

robertwt75d ago

what is that moodboard and chart of hypertension in the middle of the article that isn't explained?

This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription

jayess5d ago

I asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."

creamyhorror5d ago

It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.

KaoruAoiShiho5d ago

This is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.

hit8run5d ago

Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.

Computer05d ago

Regrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.

PetrBrzyBrzek5d ago

I'm a bit shocked that GLM 5.2 is not multimodal. Like, how should I use it? I use images all the time.

Havoc5d ago

It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4

Their servers are melting though - getting more timeouts etc

zftnb6665d ago

Open-weight models are winning. The gap with closed models is now measured in months, not years.

nh43215rgb5d ago

> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.

That is unfortunate...

blt5d ago

There's only one GLM in my heart: the one that includes vec3.hpp

casey23d ago

Mark my words, by the end of 2027, there will be an open weights model that is better than anything OpenAI and Anthropic are capable of making. They will lose at inference scaling too.

hyqzz85d ago

It is a very useful model

catigula4d ago

Which American model did they distill this one from?

dsrtslnd235d ago

looks like I need a GB300 workstation

j / k navigate · click thread line to collapse

442 comments

221 comments · 59 top-level

unrvl225d ago· 29 in thread

This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.

CuriouslyC5d ago

thehamkercat5d ago

Kimi K2 had a vendor verifier: https://github.com/MoonshotAI/K2-Vendor-Verifier

(there's a table which shows comparison between vendors)

Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier

cedws5d ago

OpenRouter should be penalising or banning for this.

3 more replies

unrvl225d ago

the 2 I mentioned both have a fairly large following, who run benchmarks and absolutely will spot issues.

stanac5d ago

> Some are even offering API rates at 3x lower than the official ZAI api rates

edit:

I see, croft [2] 8bit for $0.50/$0.08/$2.20

[1]: https://openrouter.ai/z-ai/glm-5.2

[2]: https://ai.nahcrof.com/pricing

benjiro295d ago

Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.

Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.

1 more reply

scrlk5d ago

IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.

3 more replies

Schiendelman5d ago

I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.

andai5d ago

But it just works with Claude Code? They have a guide on their website.

https://docs.z.ai/devpack/tool/claude

Here's my setup. I add this to my .bashrc

export ZAI_API_KEY="your_key_here"

Then I just run claudez

pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api

Even more pro tip: Claude Code can set this up for you haha

2 more replies

re-thc5d ago

> There's no installer.

There's ZCode (https://zcode.z.ai). Which is like the Codex App.

That's as "easy" as it is for non-devs that you're complaining about.

2 more replies

CamperBob25d ago

It's also very hard to figure out how to run a model like this. There's no installer.

Yes, there is. It's called Claude Code. Point it at the HuggingFace URL and say "Download these weights and build whatever is needed to run them, then test the model."

1 more reply

chillfox5d ago

install opencode, then either pay $10 for their plan, or add an openrouter api key.

gerryf25d ago

I agree with this.

I'd pay for an out of the box solution. i.e. an Installer with updates

cedws5d ago

In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.

unrvl225d ago

I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.

spelk5d ago

Which coding plan are you using? How are you finding it?

embedding-shape5d ago

> Why aren't more people talking about this?

sinatra5d ago

I've tried Chinese open models few times before. They were fine, but they didn't come close to the benchmarks they were claiming.

enraged_camel4d ago

shostack5d ago

Which of those providers are:

1. Keeping your data private on in the US

2. Not training on it

3. Not quantizing the model

4. Offer reasonable latency adds rate limits

SyneRyder4d ago

OpenRouter has a list of providers, looks like NovitaAI would meet those criteria. Though not for $50/mth for 80/M tokens, which I assume is the Z.ai subscription pricing.

https://openrouter.ai/z-ai/glm-5.2

https://novita.ai/models/model-detail/zai-org-glm-5.2

knollimar5d ago

Isn't it closer to sonnet?

RussianCow5d ago

With that said, I'm excited to try GLM 5.2 because I still end up reaching for Opus and GPT 5.5 for many tasks because the open models tend to get stuck more often on complex problems.

1 more reply

redox995d ago

Definitely opus level for coding.

2 more replies

Hamuko5d ago

I’m not that interested in models that I can’t run on my desktop for ~0€, which is my AI budget.

andai5d ago

Electricity cost seems to be about $30/month for a 32B model on a GPU. It's probably better on Apple hardware.

https://github.com/QuantiusBenignus/Zshelf/discussions/2

Not accounting for hardware, of course :)

2 more replies

igravious5d ago

Cool beans. You're not the target audience then.

1 more reply

anuramat5d ago

> unlimited tokens for $50 a month

link?

> Why

more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench

CuriouslyC5d ago

1 more reply

Tiberium5d ago· 28 in thread

I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.

Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

benjiro295d ago

GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.

In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.

There has been really no training on Opus models going on, really, none i tell you! /sarcasm

matheusmoreira5d ago

> GLM 5.2 Max = Opus 4.8 Max in thinking behavior

This is insane! I can't wait until technology progresses to the point we can run these things on consumer hardware!

3 more replies

FooBarWidget5d ago

maxdo5d ago

looking at the score this is rather a gemini 3.5 flash competitor, yes, for cheaper, but distance to opus and fable is as big as their price diff.

vitalyan1235d ago

11 more replies

alexjplant5d ago

> It seems to really be a nice step-up and is getting quite close to the frontier.

Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating.

carter20994d ago

vorticalbox5d ago

This is a problem I find with opus is will spend so long thinking then going “but wait what if”

To point where I stop it and simple tell it to “start writing code you can work it out as you go along”

Seems writers block also effects LLM

robertkarl5d ago

https://arxiv.org/abs/2606.00206

3 more replies

giancarlostoro5d ago

Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.

1 more reply

mikeocool5d ago

Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.

Just output the code and we’ll work through it!

1 more reply

epolanski5d ago

Fable was 20 times worse on that.

It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.

2 more replies

drob5185d ago

Qwen is notorious for this, too. It’ll sometimes spin in a long loop of “But wait…” paragraphs.

thinkingtoilet5d ago

h14h5d ago

Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs.

Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.

h14h3d ago

I've been doing some testing with GLM 5.2 on Fireworks and it looks like the "High" reasoning level uses fewer tokens than even K2.7 Code by a considerable margin (roughly half).

bertili5d ago

This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.

[1] https://z.ai/blog/glm-5.2

Tiberium5d ago

Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high.

1 more reply

cmrdporcupine5d ago

> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.

And this was high, not max.

guelo5d ago

Using these open models really makes you realize how subsidized Anthropic and OpenAi's subscription plans are.

1 more reply

esafak5d ago

abgruszecki4d ago

robmccoll5d ago

gbingles5d ago

2 more replies

rdsubhas5d ago

As per stats in other comments, it is frontier, not close to frontier.

xyzsparetimexyz5d ago

Reminiscent of https://en.wikipedia.org/wiki/Portia_(spider)

HWR_144d ago

I thought you could not compare tokens across models because their cost and speed was so different between models.

nurumaik5d ago

You asked for maximum effort, you got maximum effort

kristopolous5d ago· 16 in thread

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

papersail5d ago

  score  age  size   name
  62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  59.1   55   -      GPT-5.5 (xhigh)
  58.5   55   -      GPT-5.5 (high)
  57.2   104  -      GPT-5.4 (xhigh)
  56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  56.2   55   -      GPT-5.5 (medium)
  55.5   118  -      Gemini 3.1 Pro Preview
  53.1   132  -      GPT-5.3 Codex (xhigh)
  53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  52.1   55   -      GPT-5.5 (low)
  51.5   92   -      GPT-5.4 mini (xhigh)
  50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  50.7   1    large  GLM-5.2 (max)
  50.1   29   -      Qwen3.7 Max
  48.7   188  -      GPT-5.2 (xhigh)
  48.6   55   -      GPT-5.5 (Non-reasoning)
  48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  47.8   205  -      Claude Opus 4.5 (Reasoning)

tcp_handshaker5d ago

Short comments...

- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...

- China is going to eat the US lunch on AI

- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?

9 more replies

christoff125d ago

Lol thank you for sorting.

Are the scores here normalized such that each point difference is equidistant?

papersail5d ago

  rank  score  age  size   name
  1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  2     59.1   55   -      GPT-5.5 (xhigh)
  3     58.5   55   -      GPT-5.5 (high)
  4     57.2   104  -      GPT-5.4 (xhigh)
  5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  6     55.5   118  -      Gemini 3.1 Pro Preview
  7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  8     53.1   132  -      GPT-5.3 Codex (xhigh)
  9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  10    51.5   92   -      GPT-5.4 mini (xhigh)
  11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  12    50.7   1    large  GLM-5.2 (max)
  13    50.1   29   -      Qwen3.7 Max
  14    48.7   188  -      GPT-5.2 (xhigh)
  15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  16    47.8   205  -      Claude Opus 4.5 (Reasoning)
  17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
  18    47.5   70   -      Muse Spark
  19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
  20    47.1   58   large  Kimi K2.6
  21    47.1   29   -      Gemini 3.5 Flash (minimal)
  22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
  23    46.5   211  -      Gemini 3 Pro Preview (high)
  24    46.5   16   -      Qwen3.7 Plus
  25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
  26    45.6   5    large  Kimi K2.7 Code
  27    45.6   104  -      GPT-5.4 (low)
  28    45.5   56   large  MiMo-V2.5-Pro
  29    45.1   43   -      GPT-5.5 Instant (May 2026)
  30    45.0   29   -      Gemini 3.5 Flash (high)
  31    44.9   58   -      Qwen3.6 Max Preview
  32    44.7   216  -      GPT-5.1 (high)
  33    44.2   188  -      GPT-5.2 (medium)
  34    44.2   126  large  GLM-5 (Reasoning)
  35    43.9   92   -      GPT-5.4 nano (xhigh)
  36    43.4   71   large  GLM-5.1 (Reasoning)
  37    43.4   16   large  MiniMax-M3
  38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
  39    43.0   188  -      GPT-5.2 Codex (xhigh)
  40    42.9   76   -      Qwen3.6 Plus
  41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
  42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
  43    42.2   99   -      Grok 4.20 0309 (Reasoning)
  44    42.1   56   large  MiMo-V2.5
  45    41.9   91   large  MiniMax-M2.7
  46    41.4   91   -      MiMo-V2-Pro
  47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
  48    41.0   48   -      Grok 4.3 (high)
  49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
  50    40.5   342  -      Grok 4
  51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)

2 more replies

bel85d ago

you left some models out like DeepSeek and Kimi, for example.

2 more replies

alecco5d ago

Consider using decrementing score order (best on top)

kristopolous5d ago

then I'd have to scroll up over 500 lines after running it every time to see what I care about.

But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...

add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.

The original link has been updated accordingly with the new code.

1 more reply

sosodev5d ago

bodhi_mind5d ago

Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.

slig5d ago

Thanks for sharing. I'm curious: why didn't you sort with the score descending?

kristopolous5d ago

Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?

2 more replies

fridder5d ago

Not OP but if you run this from the CLI it does make the ordering make a little more sense

snsnbsne5d ago

Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.

Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.

Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.

jarjoura5d ago

Seems legit. My experiments with GLM-5.2 so far have resulted in strange hallucinations in the tiniest of places. Like a wrong variable name.

It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.

scrollop5d ago

Would be interesting to see where gpt 5.5 pro extended is.

drob5185d ago

Maybe your script could sort based on score.

mrngld5d ago· 10 in thread

https://artificialanalysis.ai/agents/coding-agents?coding-ag...

undecidabot5d ago

It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.

[1] https://z.ai/blog/glm-5.2

mrngld5d ago

1 more reply

cmrdporcupine5d ago

I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.

It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

Having better luck with MiniMax M3, from a cost/benefit ratio.

pjerem5d ago

I really like DeepSeek V4 Pro. It's pretty smart and I get so much usage out of it on a $20 Ollama cloud plan.

zooming5d ago

Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.

1 more reply

re-thc5d ago

> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

GPT can find fault in everything and anything including its own work.

2 more replies

lukewarm7075d ago

with open models you can get a subscription with privacy, at the same cost as codex.

openai, google and anthropic subscriptions are not available with privacy.

looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.

so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.

vadansky5d ago

> with open models you can get a subscription with privacy

Unless you're running it locally, aren't you just trusting some other entity?

3 more replies

ttul5d ago

https://deepswe.datacurve.ai/

Fable 5 is cool and all, but we have not yet seen GPT-5.6.

slagfart4d ago

GLM5.2 isn't even on this benchmark

1 more reply

CubsFan10605d ago· 10 in thread

Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.

wongarsu5d ago

Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic

user439285d ago

While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.

There they can deploy these models while using the existing legal frameworks.

CubsFan10605d ago

What kind of hardware/price does it take to run those?

2 more replies

moffkalast5d ago

MikhailTal5d ago

petesergeant5d ago

Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.

CubsFan10605d ago

I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.

tancop5d ago

1 more reply

Havoc5d ago

It’s a ~750B model so still a hell of a lot of vram

Would need to be a pretty determined medium biz

re-thc5d ago

> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

Years.

Even Microsoft said they don't have enough for Github and need to call Amazon.

Getting a few even at decent prices is hard. Unless the shortages goes down...

simonw5d ago· 9 in thread

I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.

That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.

In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.

Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.

0xbadcafebee5d ago

Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"

ricardobeat5d ago

That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.

1 more reply

WASDx5d ago

Are you suggesting it should summarize the image in text or generate it in HTML or something else?

_pdp_5d ago

simonw5d ago

One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.

Even the local models I run on my Mac are getting surprisingly good at that now.

1 more reply

tiahura5d ago

Using llms to generate docx. Being able to rasterize and review is an important part of the process.

x3cca5d ago

I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited.

abby30104d ago

Agreed, that's actually one step that will make people adopt it widely for customer facing AI Agent!

ashenke5d ago

I had the same reaction with Deepseek V4 ! It would be more useful as a vision model

rahidz5d ago· 8 in thread

segmondy5d ago

dryarzeg5d ago

mordae5d ago

They do not and it sucks for certain tasks.

It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.

osti5d ago

freigeist795d ago

it helps giving them a cli vision tool (curl to openrouter vision model for example)

adrian_b5d ago

That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.

With open weights LLMs, it is affordable to use many different models, each for whatever it is better.

0xbadcafebee5d ago

Havoc5d ago

They have a separate VL model but never tried it

CuriouslyC5d ago· 5 in thread

Havoc5d ago

> while being a little bit verbose

Discovered today that they set reasoning effort to max by default. So that’s probably why

igravious5d ago

elwebmaster5d ago

You are not alone. How about GPT 5.5? Does it come close to Fable 5?

2 more replies

sdesol5d ago

> GLM writing

What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.

andai5d ago

This is my workflow. And then once a day I copy paste the code into the free Claude Sonnet so it comes out actually readable.

tensegrist5d ago· 4 in thread

am i missing something?

OtherShrezzing5d ago

I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.

acchow5d ago

pareto frontier does not mean cheapest.

xiaoyu20065d ago

Some models are heavily subsidized. Total params & active params are better measurement of inference cost.

simianwords5d ago

No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)

1 more reply

Pragmata5d ago· 4 in thread

So this basically means we will have a near opus level model able to be run locally in the next couple of months right?

QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?

CamperBob25d ago

XCSme5d ago

Which Opus?

GLM-5.2 is already close to Opus-4.7 level:

https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

XCSme5d ago

Oh, or you meant a smaller model than GLM-5.2 with similar capabilities?

2 more replies

segmondy5d ago

1 more reply

ponyous5d ago· 4 in thread

Here are the results compared to Gemini 3.5 Flash:

    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%

Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.

NiloCK5d ago

Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?

I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like

- give 3d modelling task

- render and snapshot from a variety of angles

- feed to third-party vision model for a "what is this" type query

- grade on end-to-end accuracy

Bonus points for asking the vision model something like "how beautiful is this 1-10".

ponyous5d ago

I don't have the eval results live yet, so I cannot share them yet.

Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):

    <0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
    <0.4 → Weak – Partially relevant; significant omissions or errors.
    <0.6 → Fair – Covers main points but lacks completeness or precision.
    <0.8 → Good – Mostly accurate; minor gaps or deviations.
    <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.

Here is the scenario list (prompts are much more detailed):

    dragon-bottle-stopper
    editing-param-mid-conv
    editing-parametric-enclosure
    editing-swap-material-param
    editing-text-edit-cube
    multi-turn-bird-house
    multi-turn-dice-tower
    multi-turn-modular-planter
    multi-turn-phone-stand
    multi-turn-shelf
    one-shot-bookend
    one-shot-cable-clip
    one-shot-chess-queen
    one-shot-coaster
    one-shot-coffee-cup
    one-shot-dog-tag
    one-shot-dragon-figurine
    one-shot-hex-bracket
    one-shot-keychain-fob
    one-shot-low-poly-tree
    one-shot-pegboard-hook
    one-shot-pi4-case
    one-shot-threaded-jar

[0]: https://grandpacad.com

1 more reply

ComputerGuru5d ago

Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?

ponyous5d ago

Absolutely. Running it now, will update this comment in about 30 mins.

Edit: Surprisingly very good results with 3.0 flash with high thinking.

Cost: $0.06

Duration: 3.22 min

Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)

Adherence was on par with 3.5 flash Low thinking

1 more reply

JustSkyfall5d ago· 4 in thread

The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/

CuriouslyC5d ago

Mashimo5d ago

I have used GLM since version 4.8 I think and do enjoy using them. More then other models like Kimi or Deepseek. Though only tested them on smaller private projects.

1 more reply

bel85d ago

I beg to differ. I replaced a $40/mo GitHub Copilot subscription where I used Opus 4.6 and GPT 5.5 with a $10/mo opencode Go plan where I use mostly DeepSeek V4 Flash and testing MiMo 2.5.

I work on mid-sized projects currently (200k to 1kk lines of code).

1 more reply

segmondy5d ago

1 more reply

kissgyorgy5d ago· 4 in thread

I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.

The model might be good, but if the API is so bad, it's effectively useless.

[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...

segmondy5d ago

The entire point of this post is that it's open weights, you can run it yourself and don't have to deal with the API issues. You really do have that choice.

1 more reply

Havoc5d ago

That’s what happens when you offer something decent at a fraction of the price of opus - more demand than you can serve

ComputerGuru5d ago

Give it a few days and additional provider will be up and available on OpenRouter. Then the game of figuring out who’s not nuking the weights and neutering the quantization begins.

osti5d ago

I indeed got a few timeouts yesterday using the official API, I imagine for the coding plan users it'll be even worse.

XCSme5d ago· 3 in thread

In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

XCSme5d ago

I think the problem is, as can also be seen on other benchmarks, is that most models nowadays are focused more and more purely on tool calling and coding.

This means, that models are losing more and more general and domain-specific knowledge.

Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:

AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg

IFBench: https://i.snipboard.io/74kg0R.jpg

HDBaseT5d ago

Well, in that example it still seems the big players are increasing overall "intelligence" as Fable tops the list.

OpenAI has big incentives to improve general interligence as a large percentage of users use ChatGPT for support, finances, questions, etc. Not just coding.

sourcecodeplz5d ago

man, i love dsv4-flash but i found its weaknesses in complex projects with multiple moving parts. tried kimi 2.6 and it understood and could work on the task. bigger is better..

hereme8885d ago· 3 in thread

Hmmm... GLM insists it's Gemini.

https://github.com/zai-org/GLM-5/issues/79

coder5435d ago

Claude Sonnet 4.6 identified itself as DeepSeek repeatedly: https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...

I tested this myself a few months ago, and confirmed that it was really happening.

1 more reply

bityard5d ago

1 more reply

adastra225d ago

Then why does it score better than any Gemini model?

1 more reply

gertlabs5d ago· 2 in thread

Data at https://gertlabs.com/rankings

nsoonhui5d ago

I really have to take your score with a grain of salt because Opus 4.5 does better than Opus 4.6

gertlabs4d ago

We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.

kingstnap5d ago· 2 in thread

According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.

Excited to see if this turns out to be a Open Weight Opus 4.5 or better.

andai5d ago

The only benchmarks that matters is your actual task.

I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.

There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)

As far as they go, though, these harder benchmarks match my experience more closely:

https://deepswe.datacurve.ai/

and https://cognition.ai/blog/frontier-code

Where we see "top" models drop way down in score when given longer tasks.

That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)

By the time I'm done testing all the Chinese models, they'll be obsolete :)

adastra225d ago

According to reports in this thread it is somewhere between Opus 4.7 and 4.8. This is effectively frontier.

_pdp_5d ago· 2 in thread

I am helpful.

DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.

LUmBULtERA5d ago

Your system prompt is showing.

kreddor5d ago

Maybe he meant "hopeful"...

1 more reply

daniban5d ago· 2 in thread

I'm curious what harness everyone is using for these? I want to start to test some of these open models but don't know what tools people use to get these working "agenticaly"

gorbypark5d ago

I am using OpenCode with the DeepSeek API with some pretty good results.

zackify5d ago

pi.dev and ask ai to add features you miss from claude or codex. i configure keyboard shortcuts and swap models easily

piterrro5d ago· 2 in thread

DeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.

enraged_camel5d ago

People always say stuff like this, but it is misleading. The reason it's misleading is because that remaining 5% makes a huge difference, and is where most of the value of using AI agents lies.

1 more reply

0xbadcafebee5d ago

....so use DeepSeek v4 Pro for 95% of your coding tasks, and GLM 5.2 for the other 5%? You don't need to stick to one model.

eckelhesten5d ago· 2 in thread

Sure, but whatever you do, don't buy their (Z.ai) lite plan.

I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.

granra5d ago

How are you using it? I have the lite plan and I've only ever maxed my weekly usage a few hours before reset. I will concede that I'm not a super heavy LLM user but it's been really good for me.

My workflow is usually:

- read file. I want to achieve X, how do? Do not implement anything.

- I would do a, b and c

- sketch a brief implementation of your suggestion

- <code> (not writing files yet)

- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?

- <code>

- nice, implement this

- starts writing files, run tests, etc.

1 more reply

Alifatisk5d ago

Did you consider their peak hours and model usage multiplier? Read the green box https://docs.z.ai/devpack/overview#usage-instruction

I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.

1 more reply

adithyaharish5d ago· 2 in thread

why do not all open source LLM's have open weights like this model?

bigyabai5d ago

https://en.wikipedia.org/wiki/Artificial_scarcity

Retro_Dev5d ago

1 more reply

m-dot-reviews5d ago· 1 in thread

For anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/

I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.

swingboy5d ago

I get a 500 when clicking “Explore the Models”

1 more reply

Imustaskforhelp5d ago· 1 in thread

I have been trying out GLM 5.2 and I am really impressed by it for the most part.

To all people on Hackernews, I am curious as to what agent harness are you using it with.

Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.

I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?

Alifatisk5d ago

I just used CC with GLM, I was satisfied.

dizhn5d ago· 1 in thread

FYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en

Alifatisk5d ago

Where can I read more about the coming 3mil GLM 5.2?

1 more reply

guybedo5d ago· 1 in thread

It's probably a good model but they used GLM 5.1 to code their infra.

I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.

Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.

bigyabai5d ago

Are you using it for long context windows? I burn through my 5hr quota with GLM almost instantly on 200k+ contexts, but if I reset every ~100k or so it's much more manageable.

RDTvlokip5d ago· 1 in thread

sinuhe695d ago

Recent incident with the Rio 3.5 model clearly shows that many coding models are specifically trained/fine tuned for the benchmarks.

1 more reply

lousken5d ago· 1 in thread

Cerebras really needs to have this on their API list (if they even still exist).

Marciplan5d ago

they went public a few weeks ago

1 more reply

sourcecodeplz5d ago· 1 in thread

1m context btw.

Alifatisk5d ago

And apparently, actual support for 1M context window, not just theoretical.

wongarsu5d ago

It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.

That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions

SwellJoe5d ago

https://swelljoe.com/post/will-it-mythos/

1 more reply

xiaoyu20065d ago

This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.

osti5d ago

Fun fact: Zhipu aka Z.ai, Knowledge Atlas etc., the company that made GLM, is listed on Hong Kong stock exchange, is up over 10x since the IPO at the beginning of this year.

davidwritesbugs5d ago

leemoore5d ago

ramon1565d ago

I haven't extensively used 5.2 yet, but it seems a lot better.

tomerbd4d ago

I code daily with AI - real programming tasks, professional, real work, read customers, I use below 3:

- codex 5.5 medium - best results less hand holding medium speed

- opus 4.8 max - mediocre with hand holding medium speed

- glm 5.2 max - mediocre with hand holding and super slow

- composer 2.5 - mediocre with hand holding and super fast

I use all, since i run mulitple coding in parallel. disclosure - I use rexide which we created for all these agents to run in parallel with good visibility and feedback.

bizer4d ago

redbell5d ago

Launch announcement from four days ago: https://news.ycombinator.com/item?id=48518684

The requirements to run this model locally: https://www.reddit.com/r/LocalLLaMA/comments/1u8ai2a/glm52_i...

mesmertech5d ago

https://mesmer.tools/benchmarks/ai-video-generation

gauravvij1374d ago

They've come along pretty far now.

I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.

jauntywundrkind5d ago

Also so wild that it's relatively compact. 753B-40A is so reasonable, shows incredible scaling in what the model can do, without just throwing heaps of new parameters in.

This is silly but I dig how 753 is very close to 745, which is the watts in a HP. 1bHP parameter model. Silly, but I enjoy it.

alansaber5d ago

aunty_helen4d ago

Before you go and sign up to the max plan like I did, they are obviously struggling for capacity. I'm getting API rate limited and 429'd on a simple "hello"

robertwt75d ago

what is that moodboard and chart of hypertension in the middle of the article that isn't explained?

This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription

jayess5d ago

I asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."

creamyhorror5d ago

It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.

KaoruAoiShiho5d ago

This is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.

hit8run5d ago

Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.

Computer05d ago

Regrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.

PetrBrzyBrzek5d ago

I'm a bit shocked that GLM 5.2 is not multimodal. Like, how should I use it? I use images all the time.

Havoc5d ago

It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4

Their servers are melting though - getting more timeouts etc

zftnb6665d ago

Open-weight models are winning. The gap with closed models is now measured in months, not years.

nh43215rgb5d ago

> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.

That is unfortunate...

blt5d ago

There's only one GLM in my heart: the one that includes vec3.hpp

casey23d ago

Mark my words, by the end of 2027, there will be an open weights model that is better than anything OpenAI and Anthropic are capable of making. They will lose at inference scaling too.

hyqzz85d ago

It is a very useful model

catigula4d ago

Which American model did they distill this one from?

dsrtslnd235d ago

looks like I need a GB300 workstation

j / k navigate · click thread line to collapse