Granite 4.1: IBM's 8B Model Matching 32B MoE (opens in new tab)

(firethering.com)

312 pointssteveharing110d ago202 comments

https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

202 comments

I test drove it yesterday. It's pretty impressive at 8b. Runs on commodity hardware quickly.

Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.

The 4b they released was not good for my needs but could probably handle tool calls or something

vessenes10d ago

Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.

v3ss0n9d ago

Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.

xrd9d ago

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.

59nadir9d ago

Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.

Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.

2ndorderthought9d ago

Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.

lambda9d ago

Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.

The Qwen models are quite solid though.

zkmon9d ago

I have tested Gemma4-26B against Qwen3.6-35B. Gemma beats Qwen on structured data extraction and instruction following. Gemma is far more precise than Qwen in these tasks, while Qwen gets a bit more creative, verbose, and imprecise. However Qwen has far more general smartness, high token throughput. Qwen could precisely pinpoint the issues in data quality and code, while Gemma had no clue. On the coding skills, Qwen appears to have edge over Gemma, but this could depend on the agent you use. For direct chat (llama_cpp UI), bot models show same skills for coding.

2ndorderthought10d ago

I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases

The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.

throwaw129d ago

can you share your use cases for 2b and 4b models?

curious how people are leveraging these models

1 more reply

steveharing1OP10d ago

Yea, No doubt Qwen 3.6 open weights are far more strong

rnadomvirlabe10d ago

Why no doubt?

captainbland10d ago

No comparison with competitor models other than the previous granite version strongly implies that it does not compete well with other comparable models. At least this is the most reasonable assumption until data comes out to the contrary

2ndorderthought10d ago

Qwen 36 is effectively a pocket sized frontier model. It's really surprising for me anyway

steveharing1OP10d ago

Because Qwen 3.6 pushes way above its weight. Granite 8B is impressive, but Qwen still wins on raw capability, especially for coding.

3 more replies

cyanydeez9d ago

Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.

m3at9d ago

https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

Original article on IBM research

Hugging face weights: https://huggingface.co/collections/ibm-granite/granite-41-la...

cbg010d ago

The real "sleeper" might be https://huggingface.co/ibm-granite/granite-vision-4.1-4b if the benchmarks hold up for such a small model against frontier models for table & semantic k:v extraction.

uf00lme9d ago

Woah, is this part of the future of models? Basically little models you can use as tools.

tonyarkles9d ago

https://www.docling.ai/

I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.

2ndorderthought9d ago

It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.

Lalabadie9d ago

This is Apple's bet, among others.

Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.

1 more reply

SecretDreams9d ago

Eventually we'll have models small enough to do a single thing really well and we'll call them functions.

hathym9d ago

True if you can write a function that summerize an article for example

cyanydeez9d ago

I'm pretty sure there's someone somewhere who'll create a proper harness that's equivalent to one giant model. The difficulty is mostly local hardware has lot of memory constraints. Targeting 128GB would seem to be the current sweet spot. If we could get out of the corporate market movers of buying up all the memory, we could maybe have more.

Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.

smj-edison9d ago

On the topic of local models, is there a good equivalent to something like Claude's chat interface? I've recently started transitioning to open models after getting fed up with Claude's usage limits (I'm not in a position to drop $200/month), and for coding tasks Kimi 2.6 has been about the same as Sonnet in my experience. The only thing I've found myself missing is a nice interface to ask it questions and have it help me with my math assignments.

0xbadcafebee9d ago

Yes but not exactly.

- A lot of people suggesting llama-server's web ui, but that requires you use local AI (llama.cpp), it's persisting content into your browser rather than the server (so you can lose your chats), and it doesn't support much functionality.

- There are some pure-browser chat interfaces that are like llama-server but you can use remote LLMs. This is closer to what you want, but everything is stored in the browser, so backup is harder.

- There's LocalAI, which is like the llama-server option, but more stuff is built in and it persists data to disk. It's flashy and very easy if all you want to do is local AI.

- There's LM Studio, which is another thing like LocalAI, but a desktop app.

- There's OpenWebUI, where it's like LocalAI, except you don't do local inference, you use remote LLMs. It sucks to be honest, just stops working a lot of the time, UX is terrible, lots of weird bugs.

- There's OpenHands, which is more like Codex/Claude Code web UI. You run it locally and connect to remote LLMs. Kinda clunky, limited, poor design. Like most coding agents, it doesn't support all the features you would want, like LocalAI/OpenWebUI do.

- There's OpenCode's web UI, which is like OpenHands, but less crappy.

- There's Jan, which is probably what you want. It's a desktop app rather than a web UI.

lostmsu9d ago

I started using https://github.com/milisp/codexia/ (which is a desktop app or a web server) that wraps your regular codex-cli or Claude Code CLI. So you can see Codex/Claude threads in your web UI and access it remotely. I love it because you can do Web UI or terminal and all conversations are preserved.

Unfortunately it is pretty buggy, so I am maintaining a fork matching my personal needs with bugfixes and a few extra features.

SwellJoe9d ago

Most of the common ways to run local LLMs include a chat interface. llama.cpp's `llama-server` stands up a chat interface on 8080, as well as an OpenAI compatible API. LM Studio is a desktop app with a chat interface and API, as well. unsloth Studio, too.

LM Studio is nice in that it makes it easy to add tools, like search. Qwen 3.6 is such a small model that it lacks a lot of knowledge of the world (so it can hallucinate at an uncomfortable rate, which is a common failure mode of very small models), but it can use tools, so being able to search lets it research before answering. It has pretty good reasoning and tool calling, so it's actually pretty effective. I've been comparing Gemma 4 (31B at 8-bits, also very good with tools and reasoning for its size), Qwen 3.6 (27B at 8-bits), against Claude Opus and Gemini Pro lately. And, obviously the frontiers are better, but most of the time, I find the tiny models are fine. I'm still not quite at the point where I'd be willing to code with local models, as the time wasted on hallucinations and logic bugs and sloppy coding practices are much higher, as is the cost of security bugs that make it past review.

Svoka9d ago

With Ollama* you can use Claude Code with `ollama launch claude`

* https://docs.ollama.com/integrations/claude-code

rglullis9d ago

Open WebUI or Jan (https://www.jan.ai/). Work well with Ollama.

camdv9d ago

Ollama does this, as does llama-server from llama.cpp

steveharing1OP9d ago

You can try Open WebUI. Its genuinely useful when it comes to running open models locally with a clean interface

RationPhantoms9d ago

Yep, couple Open WebUI for general chats and OpenCode for software-specific tasks and it feels close to Claude Desktop and Claude Code.

mudkipdev9d ago

I re-created Claude's interface closely here, feel free to fork https://github.com/mudkipdev/chat

simonw9d ago

I've been mostly using LM Studio for this recently. Ollama has an OK chat UI now too. 'brew install llama.cpp' gets you 'llama-server' which provides quite a good web UI.

blurbleblurble9d ago

Codex cli is open source

lostmsu9d ago

v0.125.0 finally broke open models including their own gpt-oss over llama.cpp or vllm. I don't think they will fix it.

rangerelf9d ago

llama-server from the llama.cpp package has a local web interface.

steveharing1OP9d ago

yes. I've used it a lot. its very simple and good

Havoc10d ago

Interesting to see a pivot away from MoE by both IBM and mistral while the larger classes of SOTA of models all seem to be sticking to it.

Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...

embedding-shape10d ago

Makes sense, dense for small models, dense or MoE for larger ones, end up fitting various hardware setups pretty neatly, no need for MoE at smaller scale and dense too heavy at large scale.

npodbielski9d ago

I never want LLM to span me with emojis. What is the use case for that? I find it highly annoying.

Havoc9d ago

Think it can be a plus in moderation. eg in openclaw it can add some character

But yea dislike that style where each heading and bullet point gets an emoji

2ndorderthought9d ago

Shh people are paying for each token. Don't get them asking too many questions

0xbadcafebee9d ago

People complain a lot about LLM-written articles, but the human comments here on HN are far worse. Mostly a bunch of people extremely proud of themselves for not reading an LLM-written article, and then a bunch of people who take it at face value and make the model seem almost useful, and one comment that actually looked at other benchmarks. Good 'ol humanity, good at.. being emotional... and not doing analysis.....

The article makes some good points about model design (how different size models within a family can get similar results, how to filter out hallucination, math result reinforcement), so that's worth understanding. It's analyzing a paper, which only discussed 3 sizes of the same model family. But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks. The only benchmark it does well at compared to other models is non-hallucination and instruction following. Qwen 3.5 4B (among other models) easily outclass it on every other metric.

This article teaches a valuable lesson about reading articles in general. You can take useful information away from them (yes, despite being written by LLM). But you should also use critical thinking skills and be proactive to see if the article missed anything you might find relevant.

sureMan69d ago

The pro LLM rant is weird, LLMs "hallucinate" in creating detailed elaborate lies, the frontier models still do this egregiously, an LLM written article by default has 0 value since every single line could be true or it could be a convincingly crafted lie, every line has to be fact checked

I'm using Gemini 3.1 pro to help me research my thesis, it still with search enabled and on pro mode, invents entire papers that don't exist, and lies about the contents of existing papers to relate them to the context or to appease me, if I submitted an LLM written article based on the results its given me 80% of the article would be lies

Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish

halJordan9d ago

No, you're being weird (why are you calling people weird anyway, not helpful).

You're complaining about facts that have been true since words have been written on paper. If you read the article with the same criticality you read any other article you wont have the problem you complain about.

The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy

kevin429d ago

If they can't distinguish LLM text, then why should they care?

Anti-AI people like to bring up hallucination as if everything AI generates is false.

I can write pages of text, with my own content, and then use AI to improve my writing and clarity. Then I review and edit. It might have some LLM markers in there, which I remove sometimes because it's distracting. But the final, AI assisted writing is easier to read and better organized. But all the ideas are mine. Hallucinations are not remotely a problem in this case.

Forgeties799d ago

If you can’t distinguish between fake images and real ones why should you care?

1 more reply

phkahler9d ago

>> The only benchmark it does well at compared to other models is non-hallucination and instruction following.

I think instruction following is going to be the most useful thing these models do. Add a voice interface and access to a bunch of simple, straight-forward devices or APIs and you have a mildly useful assistant. If that can be done in 8B parameters it will soon run on edge devices. That's solid usefulness.

encrux9d ago

Anything that beats alexa-level intelligence on an edge-device is what I'd call useful as well, which shouldn't be too hard.

It's mind-boggling how bad current voice assistants sometimes are when you prompt them some fairly easy questions.

haolez9d ago

The problem is the signal/noise ratio in these articles. If the AI has written the article, then this same info could have been generated by my own AI, but tailored to my needs. So what, exactly, is the new info that this article is generating that I can use to consult with my AI? That's what I want to get out of this interaction.

Maybe my point is something on the lines of "Just send me the prompt"[0]

[0] https://blog.gpkb.org/posts/just-send-me-the-prompt/

danielbln9d ago

prompt + all other bits of information the context has been seeded with before the output was created (documents, web searches, other sources) in which case it might be more efficient to just consume the final deliverable (yourself or via LLM).

simonw9d ago

"The article makes some good points about model design"

But how can I tell if those are good points or not?

I don't want to invest time in reading something if the presence of those "good points" depends on a roll of the dice.

geraneum9d ago

> the human comments here on HN are far worse

I already assume some comments here are LLM written.

mkovach9d ago

I just wait until I'm hallucinating, then I comment. Keeps the classifiers honest.

elxr9d ago

I mean, obviously.

I assume some people here have never programmed a single useful thing even once in their lives.

drob5189d ago

> But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks.

Right. This just says that Granite 4.1 8B is better than a previous version, Granite 4.0-H-Small, which has 32B, 9B active.

So, they made a less bad model than before. But that doesn't tell you anything about how it compares with other models.

whalesalad9d ago

The thing is it's just a bunch of other original content that has been chewed up and regurgitated into something "new". Just show us the original content instead. This is by definition, slop. https://huggingface.co/blog/ibm-granite/granite-4-1

100ms10d ago

> Full stop.

Why people don't edit out obvious sloppification and expect to still have readers left

wewewedxfgdf10d ago

Third line in to the article: "But there’s one result in the benchmarks I keep coming back to."

I hear this sort of thing all the time now on YouTube from media/news personalities:

“And that’s the part nobody seems to be talking about.”

"And here's what keeps me up at night."

“This is where the story gets complicated.”

“Here’s the piece that doesn’t quite fit.”

“And this is where the usual explanation starts to break down.”

“Here’s what I can’t stop thinking about.”

“The part that should worry us is not the obvious one.”

“And that’s where the real problem begins.”

“But the more interesting question is the one no one is asking.”

“And this is where things stop being simple.”

It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.

I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.

frereubu10d ago

I think this kind of language predates widespread LLM use, and has been picked up from that kind of writing. It's a "and here's where it gets interesting" pattern that people like Malcolm Gladwell and Freakonomics have used, even if the same thing could be said in a way that makes it sound much less intriguing.

cwillu10d ago

There's even a word for it: “cliché”

1 more reply

helsinkiandrew9d ago

Isn't this the format of "hook-driven media" a constant stream of "second-act pivots" - where some new twist is added to a story to re-engage the reader and keep them reading.

BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.

jmbwell10d ago

The language of drama and import without meaningful substance. Words statistically likely to be used in a segue, regardless of the preceding or subsequent point. Particularly effective when it seems like you’re getting let in on a secret. Really fatiguing to read

A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”

I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable

MarsIronPI9d ago

Maybe the solution is to cull the bad, cliché writing from the training data.

1 more reply

MarsIronPI9d ago

Ugh, you're making me remember the last time I listened to NPR. It's so bad.

stuff4ben9d ago

I listen to NPR daily and I don't think I've ever heard any of them use that phrasing.

bambax10d ago

I notice this very often in LinkedIn posts, and it's annoying, but I had not realized it was LLM-speak? Isn't it possible that people write like this naturally?

wewewedxfgdf10d ago

I think LLM's have that sort of "summarise, wrap it in a bow tie, give a little dramatic punch as a preview to the next few points".

1 more reply

spicyusername10d ago

Arguably it's exactly because it was used naturally so often that the LLMs parrot it so frequently.

trvz10d ago

Yes. Some people are very trigger happy in attributing human slop to LLMs.

nwatson9d ago

Nate B Jones videos ... YouTube channel "AI News and Strategy Daily" channel uses all of these. Every video.

bityard9d ago

I listened to a lot of NPR podcasts before LLM were around, and most of them are full of these kinds of filler phrases.

Lerc10d ago

Apparently John Oliver was an LLM before they were even invented.

cbg010d ago

So are we saying it's fine that the article is written by an LLM as long as it doesn't have the tell-tale signs of LLMs?

ramon15610d ago

It's more about curating the things you're publishing. Why would I bother reading what you couldn't bother to read?

alienbaby9d ago

They could easily have read it, and thought , that communicates the information that it needs to.

No point creating busywork for yourself just shuffling words around when the information is there, no?

I guess it depends on what you want out of the article. Substance, or style?

100ms10d ago

I don't really see reason to complain about tool use, so long as the result is cohesive, accurate and that ultimately means a human has at least read their own output before publishing. It's a bit like receiving a supposedly personal letter that starts "Dear [INSERT_FIRST_NAME_FIELD]," are you really going to read such a thing?

HighGoldstein10d ago

An article without telltale signs of an LLM is indistinguishable from an article written by a human, so yes.

spicyusername10d ago

My opinion is that literature and art will continue pushing the envelope in the places they always pushed the envelope. LLMs will not change this, humans love making art, and they love doing it in new ways.

Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.

crunis10d ago

Are you referring to the literal use of the expression "full stop"? I don't see it anymore in the article, maybe they edited it out?

nielsbot9d ago

Very much an aside, but I'm struck by IBM's consistent iconic design language. For me it harkens all the way back to the futuristic design in 2001: A Space Odyssey from 1968. But you can also see it in their old mainframe hardware designs and other places.

simonw9d ago

The Granite 4.1 3B model is only 2GB from Unsloth: https://huggingface.co/unsloth/granite-4.1-3b-GGUF

I ran it in LM Studio and got a pleasingly abstract pelican on a bicycle (genuinely not bad for a tiny 3B model - it can at least output valid SVG): https://gist.github.com/simonw/5f2df6093885a04c9573cf5756d34...

tredre39d ago

Do you have any reasons to believe that granite is more immune to the effects of quantization than other tiny models? Otherwise it seems odd to judge a tiny model true capabilities by using its 4bit quant.

simonw9d ago

This model is small enough that it might be sensible to try the same prompts against all of the quant sizes to try and spot any differences.

simonw9d ago

This inspired me to give that a go: https://simonw.github.io/granite-4.1-3b-gguf-pelicans/

1 more reply

pjmalandrino9d ago

Very impressive series of SLM by IBM here.

I have been using it with their Chunkless RAG concept and it is fitting very well! (for curious https://github.com/scub-france/Docling-Studio)

I convinced that SLM are a real parto of solution for true integrated AI in process...

dash29d ago

Nah, I ain't reading that. If they can't be bothered to get a human to write it, it can't be that important. I'm glad for them though. Or sorry that happened.

osener9d ago

This is the official announcement: https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

It is not the researchers' fault that some slop got posted here instead.

tosh10d ago

IBM announcement: https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

dimitrismrtzs9d ago

The 8B class closing the gap with 32B is the real story of 2026 for anyone running models locally. I've been using smaller models for agent tool-use and the progress this year is real.

The gap that still matters most isn't intelligence — it's consistency on structured output. When you chain 5+ tool calls in sequence, even a small per-call reliability difference compounds fast. Would love to see Granite 4.1 benchmarked specifically on multi-step function calling rather than just general benchmarks.

agunapal10d ago

If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better.

vessenes10d ago

I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.

But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!

bjourne9d ago

Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!

I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.

zozbot23410d ago

MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand.

regularfry9d ago

Yes, this. I can run the 122B Qwen3.5 MoE usably on one 4090 + 64GB RAM. That's a monster of a model, comparatively speaking.

aitchnyu9d ago

Tangential. I'm a newb, can you name the concept of partitioning weights so we dont need to load whole thing?

woadwarrior019d ago

The most salient thing about these models is that they're non-reasoning models. This makes then very token efficient and particularly well suited for local inference where decoding is usually slower than with datacenter GPUs.

Link to HF collection: https://huggingface.co/collections/ibm-granite/granite-41-la...

lostmsu8d ago

Probably worse than Gemma 4 or Qwen 3.6 with thinking off.

mdp20219d ago

I read that IBM pioneered the concept of "shifting through "mid-training" from "guessing the next token" to "guessing the next logical step"". I am wondering how far is the research from "enhancing apparent reasoning" to "achieving solid, reliable reasoning".

If techniques existed to shift from "guess the next highly probable" token to "guess the best next logical step", as some interpreted said research, should not that be the foremost objective?

dissahc9d ago

qwen3.5 9b outperforms granite 4.1 30b by a huge amount (32 vs 15 on artificialanalysis benchmark)... i have no idea what made the writer of this article say so many demonstrably incorrect things

RandyOrion9d ago

Although the performance claim of 8b dense matching 32b moe is somewhat questionable, thank you granite team for releasing small dense LLMs.

latentframe9d ago

The limit is changing from scaling parameters to scaling datas quality however compute is still the big constraint

mdp202110d ago

Wish they also released an embedding model, in the line of their previous: compact (while good)...

ibgeek9d ago

They did:

https://huggingface.co/collections/ibm-granite/granite-embed...

311M and 97M versions.

steveharing1OP9d ago

Thanks for letting me know

RugnirViking10d ago

sounds interesting. Here's hoping they release a 32B model, thats a pretty good sweet spot for feasibility of home setups.

edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.

2ndorderthought10d ago

Try qwen 3.6. it will knock your socks off

1 more reply

SwellJoe9d ago

I wish AI slop articles were somehow automatically flagged and deaded. They're all flowery verbose piles of crap. Yeah, the model is interesting, but the article is trash. I can't believe real humans are willing to sign their name to this stuff.

theblazehen9d ago

> models are judged by GPT-4

An interesting choice

sexylinux8d ago

Is this a model that will create reliable output or will it also produce errors?

cubefox9d ago

It's strange that they don't include reasoning training (RLVR). Their justification doesn't sound convincing:

> While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.

I guess they currently don't have the ability to do proper RLVR.

mdp20219d ago

I may have misunderstood: is not reasoning training (RLVR) independent from the use of the "<think>" tags - is it not a method that improves results in reasoning? How do we know that it was not carried out?

Incidentally: I am trying to spend some time researching in the progresses in the area (the jump from parroting, to inconsistent apparent reasoning, to reliable reasoning).

peter_d_sherman8d ago

>"Stage two was RLHF training on general chat prompts using a reward model to improve helpfulness. This worked. AlpacaEval scores jumped around 18.9 points on average compared to the fine-tuned checkpoints.

Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed."

Observation: Math (which when fully decomposed, results in Logic) is at the core of how computers (traditional/older, non-LLM, programming languages work. If an LLM gets Math training wrong at any stage for any reason, then, in my opinion, that should be viewed as something that needs to be fixed at a lower level, not a higher one; not a later training level...

I think it would be interesting exercise to train an LLM that only deals in simple Math, simple English, and only the ability to compute simple equations (+,-,x,/)... like, what's the absolute minimum in terms of text and layers necessary to train a model like that?

I think some interesting understandings could be potentially be had by experimentation like that...

I myself would love a pure (simplest, smallest possible)

Text-to-Math only LLM (TTMLLM, TTMSLM?)

, along with all of the necessary corpuses (which would ideally be as small as possible) and instructions necessary to train such an LLM...

nickpsecurity9d ago

On changing the training mix, H20 did that with Danube in 2024:

https://arxiv.org/pdf/2401.16818

With those results, I would've already done that in any models I got to train. There's also the principle that the LLM's are often better at what they saw last in their training set. That also justifies putting more logic, code, and math in at the end for an analytical or coding model. So, a few precedents for that technique already.

rasz9d ago

me: "Bits 5-4 indicate the video mode", how do you derive VIDEO_MASK_00b = 0x03C0 from this?

Granite 4.1 8B: "I apologize for the confusion earlier. Let's clarify the derivation of the VIDEO__MASK_00b based on the description "Bits 5-6 indicate the video mode" from the equipment_list section.

Given that bits 5 and 6 are used for video mode, we need to create a mask that isolates these two bits:

Bit 5 is the least significant bit of the group (2^5 = 32 decimal, 0b00100000 in 8-bit form). Bit 6 is the next bit up (2^6 = 64 decimal, 0b01000000 in 8-bit form). Combining these bits: ...

Therefore, the correct mask for VIDEO_MASK_00b should be 0x00C0"

Errors on top of errors when converting description into binary numbers. Its hopeless for basic task like parsing/generating headers :(

robotmaxtron9d ago

"open source"

show me.

jasonlotito9d ago

Apache 2.0 License. Did you not click the link to the project? They even list it in the article.

> Apache 2.0 across the board, so commercial use is clean.

Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.

Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0

It wasn't hidden, it wasn't in some weird, out-of-the-way place. In fact, I found it so easily that I genuinely questioned whether it was real because of your comment. Like, why would anyone post what you posted if it was this easy to find?

NOPE! It was right there.

speedgoose9d ago

If I give you an amd64 elf binary under Apache2 license, is it open source?

j / k navigate · click thread line to collapse

202 comments

2ndorderthought10d ago

I test drove it yesterday. It's pretty impressive at 8b. Runs on commodity hardware quickly.

The 4b they released was not good for my needs but could probably handle tool calls or something

vessenes10d ago

Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.

v3ss0n9d ago

Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.

xrd9d ago

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.

59nadir9d ago

2ndorderthought9d ago

Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.

lambda9d ago

The Qwen models are quite solid though.

zkmon9d ago

2ndorderthought10d ago

I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases

throwaw129d ago

can you share your use cases for 2b and 4b models?

curious how people are leveraging these models

1 more reply

steveharing1OP10d ago

Yea, No doubt Qwen 3.6 open weights are far more strong

rnadomvirlabe10d ago

Why no doubt?

captainbland10d ago

2ndorderthought10d ago

Qwen 36 is effectively a pocket sized frontier model. It's really surprising for me anyway

steveharing1OP10d ago

Because Qwen 3.6 pushes way above its weight. Granite 8B is impressive, but Qwen still wins on raw capability, especially for coding.

3 more replies

cyanydeez9d ago

Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.

m3at9d ago

https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

Original article on IBM research

Hugging face weights: https://huggingface.co/collections/ibm-granite/granite-41-la...

cbg010d ago

The real "sleeper" might be https://huggingface.co/ibm-granite/granite-vision-4.1-4b if the benchmarks hold up for such a small model against frontier models for table & semantic k:v extraction.

uf00lme9d ago

Woah, is this part of the future of models? Basically little models you can use as tools.

tonyarkles9d ago

https://www.docling.ai/

I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.

2ndorderthought9d ago

It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.

Lalabadie9d ago

This is Apple's bet, among others.

Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.

1 more reply

SecretDreams9d ago

Eventually we'll have models small enough to do a single thing really well and we'll call them functions.

hathym9d ago

True if you can write a function that summerize an article for example

cyanydeez9d ago

smj-edison9d ago

0xbadcafebee9d ago

Yes but not exactly.

- There are some pure-browser chat interfaces that are like llama-server but you can use remote LLMs. This is closer to what you want, but everything is stored in the browser, so backup is harder.

- There's LocalAI, which is like the llama-server option, but more stuff is built in and it persists data to disk. It's flashy and very easy if all you want to do is local AI.

- There's LM Studio, which is another thing like LocalAI, but a desktop app.

- There's OpenCode's web UI, which is like OpenHands, but less crappy.

- There's Jan, which is probably what you want. It's a desktop app rather than a web UI.

lostmsu9d ago

Unfortunately it is pretty buggy, so I am maintaining a fork matching my personal needs with bugfixes and a few extra features.

SwellJoe9d ago

Svoka9d ago

With Ollama* you can use Claude Code with `ollama launch claude`

* https://docs.ollama.com/integrations/claude-code

rglullis9d ago

Open WebUI or Jan (https://www.jan.ai/). Work well with Ollama.

camdv9d ago

Ollama does this, as does llama-server from llama.cpp

steveharing1OP9d ago

You can try Open WebUI. Its genuinely useful when it comes to running open models locally with a clean interface

RationPhantoms9d ago

Yep, couple Open WebUI for general chats and OpenCode for software-specific tasks and it feels close to Claude Desktop and Claude Code.

mudkipdev9d ago

I re-created Claude's interface closely here, feel free to fork https://github.com/mudkipdev/chat

simonw9d ago

I've been mostly using LM Studio for this recently. Ollama has an OK chat UI now too. 'brew install llama.cpp' gets you 'llama-server' which provides quite a good web UI.

blurbleblurble9d ago

Codex cli is open source

lostmsu9d ago

v0.125.0 finally broke open models including their own gpt-oss over llama.cpp or vllm. I don't think they will fix it.

rangerelf9d ago

llama-server from the llama.cpp package has a local web interface.

steveharing1OP9d ago

yes. I've used it a lot. its very simple and good

Havoc10d ago

Interesting to see a pivot away from MoE by both IBM and mistral while the larger classes of SOTA of models all seem to be sticking to it.

embedding-shape10d ago

Makes sense, dense for small models, dense or MoE for larger ones, end up fitting various hardware setups pretty neatly, no need for MoE at smaller scale and dense too heavy at large scale.

npodbielski9d ago

I never want LLM to span me with emojis. What is the use case for that? I find it highly annoying.

Havoc9d ago

Think it can be a plus in moderation. eg in openclaw it can add some character

But yea dislike that style where each heading and bullet point gets an emoji

2ndorderthought9d ago

Shh people are paying for each token. Don't get them asking too many questions

0xbadcafebee9d ago

sureMan69d ago

Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish

halJordan9d ago

No, you're being weird (why are you calling people weird anyway, not helpful).

The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy

kevin429d ago

If they can't distinguish LLM text, then why should they care?

Anti-AI people like to bring up hallucination as if everything AI generates is false.

Forgeties799d ago

If you can’t distinguish between fake images and real ones why should you care?

1 more reply

phkahler9d ago

>> The only benchmark it does well at compared to other models is non-hallucination and instruction following.

encrux9d ago

Anything that beats alexa-level intelligence on an edge-device is what I'd call useful as well, which shouldn't be too hard.

It's mind-boggling how bad current voice assistants sometimes are when you prompt them some fairly easy questions.

haolez9d ago

Maybe my point is something on the lines of "Just send me the prompt"[0]

[0] https://blog.gpkb.org/posts/just-send-me-the-prompt/

danielbln9d ago

simonw9d ago

"The article makes some good points about model design"

But how can I tell if those are good points or not?

I don't want to invest time in reading something if the presence of those "good points" depends on a roll of the dice.

geraneum9d ago

> the human comments here on HN are far worse

I already assume some comments here are LLM written.

mkovach9d ago

I just wait until I'm hallucinating, then I comment. Keeps the classifiers honest.

elxr9d ago

I mean, obviously.

I assume some people here have never programmed a single useful thing even once in their lives.

drob5189d ago

> But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks.

Right. This just says that Granite 4.1 8B is better than a previous version, Granite 4.0-H-Small, which has 32B, 9B active.

So, they made a less bad model than before. But that doesn't tell you anything about how it compares with other models.

whalesalad9d ago

100ms10d ago

> Full stop.

Why people don't edit out obvious sloppification and expect to still have readers left

wewewedxfgdf10d ago

Third line in to the article: "But there’s one result in the benchmarks I keep coming back to."

I hear this sort of thing all the time now on YouTube from media/news personalities:

“And that’s the part nobody seems to be talking about.”

"And here's what keeps me up at night."

“This is where the story gets complicated.”

“Here’s the piece that doesn’t quite fit.”

“And this is where the usual explanation starts to break down.”

“Here’s what I can’t stop thinking about.”

“The part that should worry us is not the obvious one.”

“And that’s where the real problem begins.”

“But the more interesting question is the one no one is asking.”

“And this is where things stop being simple.”

It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.

I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.

frereubu10d ago

cwillu10d ago

There's even a word for it: “cliché”

1 more reply

helsinkiandrew9d ago

Isn't this the format of "hook-driven media" a constant stream of "second-act pivots" - where some new twist is added to a story to re-engage the reader and keep them reading.

BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.

jmbwell10d ago

A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”

I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable

MarsIronPI9d ago

Maybe the solution is to cull the bad, cliché writing from the training data.

1 more reply

MarsIronPI9d ago

Ugh, you're making me remember the last time I listened to NPR. It's so bad.

stuff4ben9d ago

I listen to NPR daily and I don't think I've ever heard any of them use that phrasing.

bambax10d ago

I notice this very often in LinkedIn posts, and it's annoying, but I had not realized it was LLM-speak? Isn't it possible that people write like this naturally?

wewewedxfgdf10d ago

I think LLM's have that sort of "summarise, wrap it in a bow tie, give a little dramatic punch as a preview to the next few points".

1 more reply

spicyusername10d ago

Arguably it's exactly because it was used naturally so often that the LLMs parrot it so frequently.

trvz10d ago

Yes. Some people are very trigger happy in attributing human slop to LLMs.

nwatson9d ago

Nate B Jones videos ... YouTube channel "AI News and Strategy Daily" channel uses all of these. Every video.

bityard9d ago

I listened to a lot of NPR podcasts before LLM were around, and most of them are full of these kinds of filler phrases.

Lerc10d ago

Apparently John Oliver was an LLM before they were even invented.

cbg010d ago

So are we saying it's fine that the article is written by an LLM as long as it doesn't have the tell-tale signs of LLMs?

ramon15610d ago

It's more about curating the things you're publishing. Why would I bother reading what you couldn't bother to read?

alienbaby9d ago

They could easily have read it, and thought , that communicates the information that it needs to.

No point creating busywork for yourself just shuffling words around when the information is there, no?

I guess it depends on what you want out of the article. Substance, or style?

100ms10d ago

HighGoldstein10d ago

An article without telltale signs of an LLM is indistinguishable from an article written by a human, so yes.

spicyusername10d ago

Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.

crunis10d ago

Are you referring to the literal use of the expression "full stop"? I don't see it anymore in the article, maybe they edited it out?

nielsbot9d ago

simonw9d ago

The Granite 4.1 3B model is only 2GB from Unsloth: https://huggingface.co/unsloth/granite-4.1-3b-GGUF

tredre39d ago

simonw9d ago

This model is small enough that it might be sensible to try the same prompts against all of the quant sizes to try and spot any differences.

simonw9d ago

This inspired me to give that a go: https://simonw.github.io/granite-4.1-3b-gguf-pelicans/

1 more reply

pjmalandrino9d ago

Very impressive series of SLM by IBM here.

I have been using it with their Chunkless RAG concept and it is fitting very well! (for curious https://github.com/scub-france/Docling-Studio)

I convinced that SLM are a real parto of solution for true integrated AI in process...

dash29d ago

Nah, I ain't reading that. If they can't be bothered to get a human to write it, it can't be that important. I'm glad for them though. Or sorry that happened.

osener9d ago

This is the official announcement: https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

It is not the researchers' fault that some slop got posted here instead.

tosh10d ago

IBM announcement: https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...

dimitrismrtzs9d ago

The 8B class closing the gap with 32B is the real story of 2026 for anyone running models locally. I've been using smaller models for agent tool-use and the progress this year is real.

agunapal10d ago

vessenes10d ago

But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!

bjourne9d ago

I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.

zozbot23410d ago

regularfry9d ago

Yes, this. I can run the 122B Qwen3.5 MoE usably on one 4090 + 64GB RAM. That's a monster of a model, comparatively speaking.

aitchnyu9d ago

Tangential. I'm a newb, can you name the concept of partitioning weights so we dont need to load whole thing?

woadwarrior019d ago

Link to HF collection: https://huggingface.co/collections/ibm-granite/granite-41-la...

lostmsu8d ago

Probably worse than Gemma 4 or Qwen 3.6 with thinking off.

mdp20219d ago

If techniques existed to shift from "guess the next highly probable" token to "guess the best next logical step", as some interpreted said research, should not that be the foremost objective?

dissahc9d ago

qwen3.5 9b outperforms granite 4.1 30b by a huge amount (32 vs 15 on artificialanalysis benchmark)... i have no idea what made the writer of this article say so many demonstrably incorrect things

RandyOrion9d ago

Although the performance claim of 8b dense matching 32b moe is somewhat questionable, thank you granite team for releasing small dense LLMs.

latentframe9d ago

The limit is changing from scaling parameters to scaling datas quality however compute is still the big constraint

mdp202110d ago

Wish they also released an embedding model, in the line of their previous: compact (while good)...

ibgeek9d ago

They did:

https://huggingface.co/collections/ibm-granite/granite-embed...

311M and 97M versions.

steveharing1OP9d ago

Thanks for letting me know

RugnirViking10d ago

sounds interesting. Here's hoping they release a 32B model, thats a pretty good sweet spot for feasibility of home setups.

edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.

2ndorderthought10d ago

Try qwen 3.6. it will knock your socks off

1 more reply

SwellJoe9d ago

theblazehen9d ago

> models are judged by GPT-4

An interesting choice

sexylinux8d ago

Is this a model that will create reliable output or will it also produce errors?

cubefox9d ago

It's strange that they don't include reasoning training (RLVR). Their justification doesn't sound convincing:

I guess they currently don't have the ability to do proper RLVR.

mdp20219d ago

Incidentally: I am trying to spend some time researching in the progresses in the area (the jump from parroting, to inconsistent apparent reasoning, to reliable reasoning).

peter_d_sherman8d ago

Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed."

I think some interesting understandings could be potentially be had by experimentation like that...

I myself would love a pure (simplest, smallest possible)

Text-to-Math only LLM (TTMLLM, TTMSLM?)

, along with all of the necessary corpuses (which would ideally be as small as possible) and instructions necessary to train such an LLM...

nickpsecurity9d ago

On changing the training mix, H20 did that with Danube in 2024:

https://arxiv.org/pdf/2401.16818

rasz9d ago

me: "Bits 5-4 indicate the video mode", how do you derive VIDEO_MASK_00b = 0x03C0 from this?

Given that bits 5 and 6 are used for video mode, we need to create a mask that isolates these two bits:

Bit 5 is the least significant bit of the group (2^5 = 32 decimal, 0b00100000 in 8-bit form). Bit 6 is the next bit up (2^6 = 64 decimal, 0b01000000 in 8-bit form). Combining these bits: ...

Therefore, the correct mask for VIDEO_MASK_00b should be 0x00C0"

Errors on top of errors when converting description into binary numbers. Its hopeless for basic task like parsing/generating headers :(

robotmaxtron9d ago

"open source"

show me.

jasonlotito9d ago

Apache 2.0 License. Did you not click the link to the project? They even list it in the article.

> Apache 2.0 across the board, so commercial use is clean.

Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.

Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0

NOPE! It was right there.

speedgoose9d ago

If I give you an amd64 elf binary under Apache2 license, is it open source?

j / k navigate · click thread line to collapse