GLM-4.7-Flash (opens in new tab)

(huggingface.co)

378 pointsscrlk5mo ago135 comments

135 comments

99 comments · 26 top-level

vessenes5mo ago· 16 in thread

Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.

HumanOstrich5mo ago

I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.

They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.

twalla5mo ago

I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.

mlyle5mo ago

The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:

1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.

2 more replies

Miraste5mo ago

I wonder why they chose per minute? That method of rate limiting would seem to defeat their entire value proposition.

1 more reply

cmrdporcupine5mo ago

I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.

It's even cheaper to just use it through z.ai themselves I think.

Imustaskforhelp5mo ago

I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser

Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.

With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.

I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.

Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)

pseudony5mo ago

I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.

Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).

My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.

(I even picked the 10usd plan, it was fine for now).

Workaccount25mo ago

Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.

runako5mo ago

FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.

They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)

1 more reply

irthomasthomas5mo ago

Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.

skrebbel5mo ago

How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?

2 more replies

behnamoh5mo ago

> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.

mckirk5mo ago

Note that this is the Flash variant, which is only 31B parameters in total.

And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.

1 more reply

ttoinou5mo ago

Sonnet was already very good a year ago, do open weights model right are as good ?

jasonjmcghee5mo ago

Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago

cmrdporcupine5mo ago

From my experience, Kimi K2, GLM 4.7 (not flash, full), Mistral Large 3, and DeepSeek are all about Sonnet 4 level. I prefer GLM of the bunch.

If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.

But they're nowhere near Opus 4.5

epolanski5mo ago· 8 in thread

Any cloud vendor offering this model? I would like to try it.

PhilippGille5mo ago

z.ai itself, or Novita fow now, but others will follow soon probably

https://openrouter.ai/z-ai/glm-4.7-flash/providers

sdrinf5mo ago

Note: I strongly recommend against using Novita -their main gig is serving quantized versions of the model to offer it for cheaper / at better latency; but if you ran an eval against other providers vs novita, you can spot the quality degradation. This is nowhere marked, or displayed in their offering.

Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.

epolanski5mo ago

Interesting, it costs less than a tenth than Haiku.

1 more reply

latchkey5mo ago

We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.

ssh admin.hotaisle.app

Yes, this should be made easier to just get a VM with it pre-installed. Working on that.

omneity5mo ago

Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.

1 more reply

dvs135mo ago

https://huggingface.co/inference/models?model=zai-org%2FGLM-... :)

xena5mo ago

The model literally came out less than a couple hours ago, it's going to take people a while in order to tool it for their inference platforms.

idiliv5mo ago

Sometimes model developers coordinate with inference platforms to time releases in sync.

dajonker5mo ago· 6 in thread

Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.

Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.

dajonker5mo ago

Update: I'm experiencing issues with OpenCode and this model. I have built the latest llama.cpp and followed the Unsloth guide, but it's not usable at the moment because of:

- Tool calling doesn't work properly with OpenCode

- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher

- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md

I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.

eblanshey5mo ago

There is a new update on HF:

> Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

2 more replies

latchkey5mo ago

https://huggingface.co/unsloth/GLM-4.7-GGUF

This user has also done a bunch of good quants:

https://huggingface.co/0xSero

WanderPanda5mo ago

I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks

1 more reply

dajonker5mo ago

Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.

The flash model in this thread is more than 10x smaller (30B).

2 more replies

behnamoh5mo ago

> Codex is notably higher quality but also has me waiting forever.

And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.

baranmelik5mo ago· 6 in thread

For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.

johndough5mo ago

I've been running it with llama-server from llama.cpp (compiled for CUDA backend, but there are also prebuilt binaries and instructions for other backends in the README) using the Q4_K_M quant from ngxson on Lubuntu with an RTX 3090:

https://github.com/ggml-org/llama.cpp/releases

https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...

https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...

    llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf

You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions

Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.

mistercheph5mo ago

I think the recently introduced -fit option which is on by default means it's no longer necesary to -ngl, can also probably drop -c which is "0" by default and reads metadata from the gguf to get the model's advertised context size

1 more reply

ljouhet5mo ago

Something like

    ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M

It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.com

jmorgan5mo ago

It's available (with tool parsing, etc.): https://ollama.com/library/glm-4.7-flash but requires 0.14.3 which is in pre-release (and available on Ollama's GitHub repo)

zackify5mo ago

LM Studio Search for 4.7-flash and install from mlx community

pixelmelt5mo ago

I would look into running a 4 bit quant using llama cpp (or any of its wrappers)

polyrand5mo ago· 5 in thread

I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designed to work much better with Anthropic models).

Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.

RickHull5mo ago

Same, I got 12 months of subscription for $28 total (promo offer), with 5x the usage limits of the $20/month Claude Pro plan. I have only used it with claude code so far.

theshrike795mo ago

This offer was so stupid cheap there was no point in NOT getting :D

stogot5mo ago

Do they still have that promo offer?

2 more replies

victorbjorklund5mo ago

How has the performance been lately? I heard some people say that they change their limits likely making it almost not useable

chewz5mo ago

Never had any problems with Z.ai models.

However they are using more thinking internally and that makes them seem slow.

montroser5mo ago· 5 in thread

> SWE-bench Verified 59.2

This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.

achierius5mo ago

I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.

robbies5mo ago

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

2 more replies

primaprashant5mo ago

You should check out Devstral 2 Small [1]. It's 24B and scores 68.0% on SWE-bench Verified.

[1]: https://mistral.ai/news/devstral-2-vibe-cli

Palmik5mo ago

To be clear, GLM 4.7 Flash is MoE with 30B total params but <4B active params. While Devstral Small is 24B dense (all params active, all the time). GLM 4.7 Flash is much much cheaper, inference wise.

dajonker5mo ago

I don't know whether it just doesn't work well in GGUF / llama.cpp + OpenCode but I can't get anything useful out of Devstal 2 24B running locally. Probably a skill issue on my end, but I'm not very impressed. Benchmarks are nice but they don't always translate to real life usefulness.

veselin5mo ago· 4 in thread

What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent?

I am interesting if I can run it on a 24GB RTX 4090.

Also, would vllm be a good option?

tgtweak5mo ago

I like the byteshape quantizations - they are dynamic variable quantization weights that are tuned for quality vs overall size. They seem to make less errors at lower "average" quantizations than the unsloth 4 bit quants. I think this is similar to variable bitrate video compression where you can keep higher bits where it helps overall model accuracy.

Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.

edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.

omgwin5mo ago

It's been mentioned that this model is MLA capable, but it seems like the default vLLM params don't use MLA. Seeing ~0.91MB KV Footprint per token right now. Are you getting MLA to work?

regularfry5mo ago

It's in the ollama library at q4_K_M, which doesn't quite fit on my 4090 with the default context length. But it only offloads 8 layers to the CPU for me. I'm getting usable enough token rates. That's probably the easiest way to get it. Not tried it with vllm but if it proves good enough to stick with then I might give it a try.

regularfry5mo ago

Oh, and on agents: I did give it a go in opencode last night and it seemed to get a bit stuck but I think I probably pushed it too far. I asked it to explain TinyRecursiveModels and pointed it at the git repo URL. It got very confused by the returned HTML and went into a loop. But actually getting to the point of getting content back from a tool call? Absolutely fine.

I'm thinking of giving it a go with aider, but using something like gemma3:27b as the architect. I don't think you can have different models for different skills in opencode, but with smaller local models I suspect it's unavoidable for now.

twelvechess5mo ago· 4 in thread

Excited to test this out. We need a SOTA 8B model bad though!

cipehr5mo ago

Is essentialai/rnj-1 not the latest attempt at that?

https://huggingface.co/EssentialAI/rnj-1

metalliqaz5mo ago

I tried this model and if I recall correctly it was horribly over-trained on Python test questions to the point that if you asked for C code it would say something like "you asked for C code but specified answer must be in Python , so here is the Python ", even though I never once mentioned Python.

piyh5mo ago

https://docs.mistral.ai/models/ministral-3-8b-25-12

twelvechess5mo ago

thanks I will try this out

jcuenod5mo ago· 3 in thread

Comparison to GPT-OSS-20B (irrespective of how you feel that model actually performs) doesn't fill me with confidence. Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5, I would have hoped that their flash model would run circles around GPT-OSS-120B. I do wish they would provide an Aider result for comparison. Aider may be saturated among SotA models, but it's not at this size.

syntaxing5mo ago

Hoping a 30-A3B runs circles around a 117-A5.1B is a bit hopeful thinking, especially when you’re testing embedded knowledge. From the numbers, I think this model excels at agent calls compared to GPT-20B. The rest are about the same in terms of performance

victorbjorklund5mo ago

The benchmarks lie. I've been using using glm 4.7 and it's pretty okay with simple tasks but it's nowhere even near Sonnet. Still useful and good value but it's not even close.

unsupp0rted5mo ago

> Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5

Not for code. The quality is so low, it's roughly on par with Sonnet 3.5

XCSme5mo ago· 3 in thread

Seems to be marginally better than gpt-20b, but this is 30b?

strangescript5mo ago

I find gpt-oss 20b very benchmaxxed and as soon as a solution isn't clear it will hallucinate.

blurbleblurble5mo ago

Every time I've tried to actually use gpt-oss 20b it's just gotten stuck in weird feedback loops reminiscent of the time when HAL got shut down back in the year 2001. And these are very simple tests e.g. I try and get it to check today's date from the time tool to get more recent search results from the arxiv tool.

lostmsu5mo ago

It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.

Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7

linolevan5mo ago· 2 in thread

Tried it within LMStudio on my m4 macbook pro – it feels dramatically worse than gpt-oss-20b. Of the two (code) prompts I've tried so far, it started spitting out invalid code and got stuck in a repeating loop for both. It's possible that LMStudio quantizes the model in such a manner that it explodes, but so far not a great first impression.

tgtweak5mo ago

Are you using the full BF16 model or a quantized mlx4?

linolevan5mo ago

Not sure what the default is – whatever that was. It's probably the quantized mlx4 if I had to guess.

infocollector5mo ago· 2 in thread

Maybe someone here has tackled this before. I’m trying to connect Antigravity or Cursor with GLM/Qwen coding models, but haven’t had any luck so far. I can easily run Open-WebUI + LLaMA on my 5090 Ubuntu box without issues. However, when I try to point Antigravity or Cursor to those models, they don’t seem to recognize or access them. Has anyone successfully set this up?

yowlingcat5mo ago

I don't believe Antigravity or Cursor work well with pluggable models. It seems to be impossible with Antigravity and with Cursor while you can change the OAI compatible API endpoint to one of your choice, not all features may work as expected.

My recommendation would be to use other tools built to support pluggable model backends better. If you're looking for a Claude Code alternative, I've been liking OpenCode so far lately, and if you're looking for a Cursor alternative, I've heard great things about Roo/Cline/KiloCode although I personally still just use Continue out of habit.

foobar100005mo ago

Claude code router

arbuge5mo ago· 2 in thread

Perhaps somebody more familiar with HF can explain this to me... I'm not too sure what's going on here:

https://huggingface.co/inference/models?model=zai-org%2FGLM-...

Mattwmaster585mo ago

I assume you're talking about 50t/s? My guess is that providers are poorly managing resources.

Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.

arbuge5mo ago

None of it makes much sense. The model labelled as fastest has much higher latency. The one labelled as cheapest costs something, whereas the other one appears to be free (price is blank). Context on that one is blank and also unclear.

karmakaze5mo ago· 2 in thread

Not much info than being a 31B model. Here's info on GLM-4.7[0] in general.

I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.

[0] https://z.ai/blog/glm-4.7

lordofgibbons5mo ago

How interesting it is depends purely on your use-case. For me this is the perfect size for running fine-tuning experiments.

redrove5mo ago

A3.9B MoE apparently

bilsbie5mo ago· 1 in thread

What’s the significance of this for someone out of the loop?

epolanski5mo ago

You can run gpt 5 mini level ai on your MacBook with 32 gb ram.

You can get LLM as a service for cheaper.

E.g. This model costs less than a tenth of Haiku 4.5.

montroser5mo ago· 1 in thread

This is their blurb about the release:

    We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.

    The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.

https://docs.z.ai/release-notes/new-released

z25mo ago

The two notes from this year are accidentally marked as 2025, the website posts may actually be hand-crafted.

esafak5mo ago· 1 in thread

When I want fast I reach for Gemini, or Cerebras: https://www.cerebras.ai/blog/glm-4-7

GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.

mgambati5mo ago

Good instruction following is the number one reason for me that makes opus 4.5 so good. Hope next release improve this.

kylehotchkiss5mo ago· 1 in thread

What's the minimum hardware you need to run this at a reasonable speed?

My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects

metalliqaz5mo ago

I haven't tried it, but just based on the size (30B-A3B), you probably can get by with 32GB RAM and 8GB VRAM.

aziis985mo ago· 1 in thread

I hope we get to good A1B models as I'm currently GPU poor and can only do inference on CPU for now

yowlingcat5mo ago

It may be worth taking a look at LFM [1]. I haven't had the need to use it so far (running on Apple silicon on a day to day basis so my dailies are usually the 30B+ MoEs) but I've heard good things from the internet from folks using it as a daily on their phones. YMMV.

[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

cmrdporcupine5mo ago

On my ASUS GB10 (like NVIDIA Spark) with Q8_0 quantization, prompt to write a fibonacci in Scala:

HEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.

BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.

  ┌─────────┬───────────┬──────────┬───────────┐
  │         │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
  ├─────────┼───────────┼──────────┼───────────┤
  │ Tok/sec │ 13-17     │ 11-19    │ 32        │
  ├─────────┼───────────┼──────────┼───────────┤
  │ Memory  │ ~62GB     │ ~28GB    │ ~32GB     │
  └─────────┴───────────┴──────────┴───────────┘

Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do.

1 more reply

river_otter5mo ago

Excited to finally be able to give this a try today. I'm documenting my experience using aoe + OpenCode + LM Studio + GLM-4.7 Flash + Mac Mini M4Pro 64GB Mem on this thread if anyone wants to follow along and or give me advice about how badly I'm messing up the settings

https://x.com/natebrake/status/2013978241573204246

Thus far, the 6-bit quant MLX weights were too much and crashed LMS with OOM

dfajgljsldkjag5mo ago

Interesting they are releasing a tiny (30B) variant, unlike the 4.5-air distill which was 106B parameters. It must be competing with gpt mini and nano models, which personally I have found to be pretty weak. But this could be perfect for local LLM use cases.

In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.

syntaxing5mo ago

I find GLM models so good. Better than Qwen IMO. I wish they released a new GLM air so I can run on my framework desktop

eurekin5mo ago

I'm trying to run it, but getting odd errors. Has anybody managed to run it locally and can share the command?

andhuman5mo ago

Gave it four of my vibe questions around general knowledge and it didn’t do great. Maybe expected with a model as small as this one. Once support in llama.cpp is out I will take it for a spin.

pixelmelt5mo ago

I'm glad they're still releasing models dispite going public

j / k navigate · click thread line to collapse

135 comments

99 comments · 26 top-level

vessenes5mo ago· 16 in thread

HumanOstrich5mo ago

twalla5mo ago

I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.

mlyle5mo ago

The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:

1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.

2 more replies

Miraste5mo ago

I wonder why they chose per minute? That method of rate limiting would seem to defeat their entire value proposition.

1 more reply

cmrdporcupine5mo ago

I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.

It's even cheaper to just use it through z.ai themselves I think.

Imustaskforhelp5mo ago

I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser

With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.

I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.

pseudony5mo ago

I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.

My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.

(I even picked the 10usd plan, it was fine for now).

Workaccount25mo ago

Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.

runako5mo ago

FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.

1 more reply

irthomasthomas5mo ago

Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.

skrebbel5mo ago

How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?

2 more replies

behnamoh5mo ago

> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.

mckirk5mo ago

Note that this is the Flash variant, which is only 31B parameters in total.

1 more reply

ttoinou5mo ago

Sonnet was already very good a year ago, do open weights model right are as good ?

jasonjmcghee5mo ago

Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago

cmrdporcupine5mo ago

From my experience, Kimi K2, GLM 4.7 (not flash, full), Mistral Large 3, and DeepSeek are all about Sonnet 4 level. I prefer GLM of the bunch.

If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.

But they're nowhere near Opus 4.5

epolanski5mo ago· 8 in thread

Any cloud vendor offering this model? I would like to try it.

PhilippGille5mo ago

z.ai itself, or Novita fow now, but others will follow soon probably

https://openrouter.ai/z-ai/glm-4.7-flash/providers

sdrinf5mo ago

epolanski5mo ago

Interesting, it costs less than a tenth than Haiku.

1 more reply

latchkey5mo ago

We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.

ssh admin.hotaisle.app

Yes, this should be made easier to just get a VM with it pre-installed. Working on that.

omneity5mo ago

Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.

1 more reply

dvs135mo ago

https://huggingface.co/inference/models?model=zai-org%2FGLM-... :)

xena5mo ago

The model literally came out less than a couple hours ago, it's going to take people a while in order to tool it for their inference platforms.

idiliv5mo ago

Sometimes model developers coordinate with inference platforms to time releases in sync.

dajonker5mo ago· 6 in thread

Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.

dajonker5mo ago

Update: I'm experiencing issues with OpenCode and this model. I have built the latest llama.cpp and followed the Unsloth guide, but it's not usable at the moment because of:

- Tool calling doesn't work properly with OpenCode

- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher

- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md

I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.

eblanshey5mo ago

There is a new update on HF:

> Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

2 more replies

latchkey5mo ago

https://huggingface.co/unsloth/GLM-4.7-GGUF

This user has also done a bunch of good quants:

https://huggingface.co/0xSero

WanderPanda5mo ago

1 more reply

dajonker5mo ago

Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.

The flash model in this thread is more than 10x smaller (30B).

2 more replies

behnamoh5mo ago

> Codex is notably higher quality but also has me waiting forever.

And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.

baranmelik5mo ago· 6 in thread

For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.

johndough5mo ago

https://github.com/ggml-org/llama.cpp/releases

https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...

https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...

    llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf

You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions

Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.

mistercheph5mo ago

1 more reply

ljouhet5mo ago

Something like

    ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M

It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.com

jmorgan5mo ago

It's available (with tool parsing, etc.): https://ollama.com/library/glm-4.7-flash but requires 0.14.3 which is in pre-release (and available on Ollama's GitHub repo)

zackify5mo ago

LM Studio Search for 4.7-flash and install from mlx community

pixelmelt5mo ago

I would look into running a 4 bit quant using llama cpp (or any of its wrappers)

polyrand5mo ago· 5 in thread

RickHull5mo ago

Same, I got 12 months of subscription for $28 total (promo offer), with 5x the usage limits of the $20/month Claude Pro plan. I have only used it with claude code so far.

theshrike795mo ago

This offer was so stupid cheap there was no point in NOT getting :D

stogot5mo ago

Do they still have that promo offer?

2 more replies

victorbjorklund5mo ago

How has the performance been lately? I heard some people say that they change their limits likely making it almost not useable

chewz5mo ago

Never had any problems with Z.ai models.

However they are using more thinking internally and that makes them seem slow.

montroser5mo ago· 5 in thread

> SWE-bench Verified 59.2

This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.

achierius5mo ago

robbies5mo ago

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

2 more replies

primaprashant5mo ago

You should check out Devstral 2 Small [1]. It's 24B and scores 68.0% on SWE-bench Verified.

[1]: https://mistral.ai/news/devstral-2-vibe-cli

Palmik5mo ago

To be clear, GLM 4.7 Flash is MoE with 30B total params but <4B active params. While Devstral Small is 24B dense (all params active, all the time). GLM 4.7 Flash is much much cheaper, inference wise.

dajonker5mo ago

veselin5mo ago· 4 in thread

I am interesting if I can run it on a 24GB RTX 4090.

Also, would vllm be a good option?

tgtweak5mo ago

Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.

edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.

omgwin5mo ago

It's been mentioned that this model is MLA capable, but it seems like the default vLLM params don't use MLA. Seeing ~0.91MB KV Footprint per token right now. Are you getting MLA to work?

regularfry5mo ago

twelvechess5mo ago· 4 in thread

Excited to test this out. We need a SOTA 8B model bad though!

cipehr5mo ago

Is essentialai/rnj-1 not the latest attempt at that?

https://huggingface.co/EssentialAI/rnj-1

metalliqaz5mo ago

piyh5mo ago

https://docs.mistral.ai/models/ministral-3-8b-25-12

twelvechess5mo ago

thanks I will try this out

jcuenod5mo ago· 3 in thread

syntaxing5mo ago

victorbjorklund5mo ago

The benchmarks lie. I've been using using glm 4.7 and it's pretty okay with simple tasks but it's nowhere even near Sonnet. Still useful and good value but it's not even close.

unsupp0rted5mo ago

> Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5

Not for code. The quality is so low, it's roughly on par with Sonnet 3.5

XCSme5mo ago· 3 in thread

Seems to be marginally better than gpt-20b, but this is 30b?

strangescript5mo ago

I find gpt-oss 20b very benchmaxxed and as soon as a solution isn't clear it will hallucinate.

blurbleblurble5mo ago

lostmsu5mo ago

Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7

linolevan5mo ago· 2 in thread

tgtweak5mo ago

Are you using the full BF16 model or a quantized mlx4?

linolevan5mo ago

Not sure what the default is – whatever that was. It's probably the quantized mlx4 if I had to guess.

infocollector5mo ago· 2 in thread

yowlingcat5mo ago

foobar100005mo ago

Claude code router

arbuge5mo ago· 2 in thread

Perhaps somebody more familiar with HF can explain this to me... I'm not too sure what's going on here:

https://huggingface.co/inference/models?model=zai-org%2FGLM-...

Mattwmaster585mo ago

I assume you're talking about 50t/s? My guess is that providers are poorly managing resources.

Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.

arbuge5mo ago

karmakaze5mo ago· 2 in thread

Not much info than being a 31B model. Here's info on GLM-4.7[0] in general.

I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.

[0] https://z.ai/blog/glm-4.7

lordofgibbons5mo ago

How interesting it is depends purely on your use-case. For me this is the perfect size for running fine-tuning experiments.

redrove5mo ago

A3.9B MoE apparently

bilsbie5mo ago· 1 in thread

What’s the significance of this for someone out of the loop?

epolanski5mo ago

You can run gpt 5 mini level ai on your MacBook with 32 gb ram.

You can get LLM as a service for cheaper.

E.g. This model costs less than a tenth of Haiku 4.5.

montroser5mo ago· 1 in thread

This is their blurb about the release:

    We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.

    The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.

https://docs.z.ai/release-notes/new-released

z25mo ago

The two notes from this year are accidentally marked as 2025, the website posts may actually be hand-crafted.

esafak5mo ago· 1 in thread

When I want fast I reach for Gemini, or Cerebras: https://www.cerebras.ai/blog/glm-4-7

GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.

mgambati5mo ago

Good instruction following is the number one reason for me that makes opus 4.5 so good. Hope next release improve this.

kylehotchkiss5mo ago· 1 in thread

What's the minimum hardware you need to run this at a reasonable speed?

My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects

metalliqaz5mo ago

I haven't tried it, but just based on the size (30B-A3B), you probably can get by with 32GB RAM and 8GB VRAM.

aziis985mo ago· 1 in thread

I hope we get to good A1B models as I'm currently GPU poor and can only do inference on CPU for now

yowlingcat5mo ago

[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

cmrdporcupine5mo ago

On my ASUS GB10 (like NVIDIA Spark) with Q8_0 quantization, prompt to write a fibonacci in Scala:

HEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.

BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.

  ┌─────────┬───────────┬──────────┬───────────┐
  │         │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
  ├─────────┼───────────┼──────────┼───────────┤
  │ Tok/sec │ 13-17     │ 11-19    │ 32        │
  ├─────────┼───────────┼──────────┼───────────┤
  │ Memory  │ ~62GB     │ ~28GB    │ ~32GB     │
  └─────────┴───────────┴──────────┴───────────┘

Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do.

1 more reply

river_otter5mo ago

https://x.com/natebrake/status/2013978241573204246

Thus far, the 6-bit quant MLX weights were too much and crashed LMS with OOM

dfajgljsldkjag5mo ago

In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.

syntaxing5mo ago

I find GLM models so good. Better than Qwen IMO. I wish they released a new GLM air so I can run on my framework desktop

eurekin5mo ago

I'm trying to run it, but getting odd errors. Has anybody managed to run it locally and can share the command?

andhuman5mo ago

Gave it four of my vibe questions around general knowledge and it didn’t do great. Maybe expected with a model as small as this one. Once support in llama.cpp is out I will take it for a spin.

pixelmelt5mo ago

I'm glad they're still releasing models dispite going public

j / k navigate · click thread line to collapse