I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm -m mlx-community/gemma-3-27b-it-qat-4bit \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers
issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same
markdown logic as the HTML page to turn that into a
fragment'
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS
Strange results; the full model gives me slightly more TPS.
query: "make me a snake game in python with pygame"
(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token
(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token
using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...
LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.
(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)
There is another aspect to consider, aside from privacy.
These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.
However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.
That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.
("AI assistant" is an evolutionary dead end, and Star Trek be damned.)
You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M
Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.
Why is the memory use different? Are you using different context size in both set-ups?
https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?
The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.
It is important to know about both to decide between the two for your use case though.
> Be kind. Don't be snarky.
> Please don't post shallow dismissals, especially of other people's work.
In my opinion, your comment is not in line with the guidelines. Especially the part about sillytavern being the only LLM frontend that matters. Telling the devs of any LLM frontend except sillytavern that their app doesn't matter seems exactly like a shallow dismissal of other people's work to me.
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.
Prompt Tokens: 10
Time: 229.089 ms
Speed: 43.7 t/s
Generation Tokens: 41
Time: 959.412 ms
Speed: 42.7 t/s
If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.
Also an imatrix gguf like iq4xs might be smaller with better quality
That said, if you really care, it generates faster than reading speed (on an A18 based model at least).
For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second
The high end consumer card from Nvidia is the RTX 5090, and the professional version of the card is the RTX PRO 6000.
The 5070 or 5070ti are mid range. They cost 650/900€.
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
My best guess is that there's not enough discussion/development related to Powershell in training data.
If this is your main use case you can always try to fine tune a model. I maintain a small llm bench of different programming languages and the performance difference between say Python and Rust on some smaller models is up to 70%
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.
./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png
Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.
Shouldn’t it fit a 5060 Ti 16GB, for instance?
...so yeah 3090
Am I missing something?
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)
How is this more significant now than when they were uploaded 2 weeks ago?
Are we expecting new models? I don’t understand the timing. This post feels like it’s two weeks late.
[1] - https://huggingface.co/collections/google/gemma-3-qat-67ee61...
I think this means I either have to train the -pt model with my own instruction tuning or use another provider :(
Wish they showed benchmarks / added quantized versions to the arena! :>
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
I think these consumer GPUs are way too expensive for the amount of memory they pack - and that's intentional price discrimination. Also the builds are gimmicky. It's just not setup for AI models, and the versions that are cost 20k.
AMD has that 128GB RAM strix halo chip but even with soldered ram the bandwidth there is very limited, half of M4 Max, which is half of 4090.
I think this generation of hardware and local models is not there yet - would wait for M5/M6 release.
For example, if I ask mistral small who I am by name, it will say there is no known notable figure by that name before the knowledge cutoff. Gemma 3 will say I am a well known <random profession> and make up facts. On the other hand, I have asked both about local organization in my area that I am involved with, and Gemma 3 could produce useful and factual information, where Mistral Small said it did not know.
(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
Which method are you employing to feed a textbook into the model?
Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
The privacy concerns are honestly mostly imaginary at this point, too. Plenty of hosted LLM vendors will promise not to train on your data. The bigger threat is if they themselves log data and then have a security incident, but honestly the risk that your own personal machine gets stolen or hacked is a lot higher than that.
There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.
The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.
Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.
This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.
Ask it to generate SVG, and it’s very simple and almost too dumb.
Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.