Gemma 3 QAT Models: Bringing AI to Consumer GPUs (opens in new tab)

(developers.googleblog.com)

602 pointsemrah1y ago276 comments

276 comments

162 comments · 39 top-level

simonw1y ago· 33 in thread

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

rs1861y ago

Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

simonw1y ago

My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

2 more replies

overfeed1y ago

> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.

Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.

2 more replies

ein0p1y ago

Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.

2 more replies

trees1011y ago

Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model

`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS

`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS

Strange results; the full model gives me slightly more TPS.

1 more reply

k__1y ago

The local LLM is your project manager, the big remote ones are the engineers and designers :D

jonaustin1y ago

On a M4 Max 128GB via LM Studio:

query: "make me a snake game in python with pygame"

(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token

(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token

using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...

1 more reply

starik361y ago

On an A5000 with 24GB, this model typically gets between 20 to 25 tps.

pantulis1y ago

> Can you quote tps?

LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.

3 more replies

a_e_k1y ago

I'm seeing ~38--42 tps on a 4090 in a fresh build of llama.cpp under Fedora 42 on my personal machine.

(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)

DJHenk1y ago

> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

There is another aspect to consider, aside from privacy.

These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.

However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.

1 more reply

otabdeveloper41y ago

The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.

That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.

("AI assistant" is an evolutionary dead end, and Star Trek be damned.)

bobjordan1y ago

Thanks for the call out on this model! I have 42gb usable VRAM on my ancient (~10yrs old) quad-sli titan-x workstation and have been looking for a model to balance large context window with output quality. I'm able to run this model with a 56K context window and it just fits into my 42gb VRAM to run 100% GPU. The output quality is really good and 56K context window is very usable. Nice find!

paprots1y ago

The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why? Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?

zorgmonkey1y ago

Both versions are quantized and should use the same amount of RAM, the difference with QAT is the quantization happens during training time and it should result in slightly better (closer to the bf16 weights) output

kgwgk1y ago

Look up 27b in https://ollama.com/library/gemma3/tags

You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M

1 more reply

superkuh1y ago

Quantization aware training just means having the model deal with quantized values a bit during training so it handles the quantization better when it is quantized after training/etc. It doesn't change the model size itself.

nolist_policy1y ago

I suspect your "original gemma3:27b" was a quantized model since the non-quantized (16bit) version needs around 54gb.

prvc1y ago

> ~15GB (MLX) leaving plenty of memory for running other apps.

Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?

simonw1y ago

I expect not. On my Mac at least I've found I need a bunch of GB free to have anything else running at all.

1 more reply

tomrod1y ago

Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).

simonw1y ago

MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.

2 more replies

nico1y ago

Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short

Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?

simonw1y ago

The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

1 more reply

tootie1y ago

I'm using 12b and getting seriously verbose answers. It's squeezed into 8GB and takes its sweet time but answers are really solid.

Casteil1y ago

This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.

By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.

1 more reply

littlestymaar1y ago

> and it only uses ~22Gb (via Ollama) or ~15GB (MLX)

Why is the memory use different? Are you using different context size in both set-ups?

simonw1y ago

No idea. MLX is its own thing, optimized for Apple Silicon. Ollama uses GGUFs.

https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?

1 more reply

Patrick_Devine1y ago

The vision tower is 7GB, so I was wondering if you were loading it without vision?

codybontecou1y ago

Can you run the mlx-variation of this model through Ollama so that I can interact with it in Open WebUI?

simonw1y ago

I haven't tried it yet but there's an MLX project that exposes an OpenAI-compatible serving endpoint that should work with Open WebUI: https://github.com/madroidmaq/mlx-omni-server

1 more reply

ygreif1y ago

Do many consumer GPUs have >20 gigabytes RAM? That sounds like a lot to me

mcintyre19941y ago

I don't think so, but Apple's unified memory architecture makes it a possibility for people with Macbook Pros.

emrahOP1y ago· 12 in thread

Available on ollama: https://ollama.com/library/gemma3

jinay1y ago

Make sure you're using the "-it-qat" suffixed models like "gemma3:27b-it-qat"

Zambyte1y ago

Here are the direct links:

https://ollama.com/library/gemma3:27b-it-qat

https://ollama.com/library/gemma3:12b-it-qat

https://ollama.com/library/gemma3:4b-it-qat

https://ollama.com/library/gemma3:1b-it-qat

ein0p1y ago

Thanks. I was wondering why my open-webui said that I already had the model. I bet a lot of people are making the same mistake I did and downloading just the old, post-quantized 27B.

Der_Einzige1y ago

How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).

The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.

Zambyte1y ago

The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.

It is important to know about both to decide between the two for your use case though.

1 more reply

ach9l1y ago

instead of ranting, maybe explain how to make a qat q4 work with images in vllm, afaik it is not yet possible

oezi1y ago

Why is sillytavern the only LLM frontend which matters?

2 more replies

simonw1y ago

Last I looked vLLM didn't work on a Mac.

1 more reply

prometheon11y ago

From the HN guidelines: https://news.ycombinator.com/newsguidelines.html

> Be kind. Don't be snarky.

> Please don't post shallow dismissals, especially of other people's work.

In my opinion, your comment is not in line with the guidelines. Especially the part about sillytavern being the only LLM frontend that matters. Telling the devs of any LLM frontend except sillytavern that their app doesn't matter seems exactly like a shallow dismissal of other people's work to me.

janderson2151y ago

I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?

oezi1y ago

Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?

1 more reply

m00dy1y ago

Ollama is definitely not for production loads but vLLm is.

holografix1y ago· 12 in thread

Could 16gb vram be enough for the 27b QAT version?

jffry1y ago

With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory usage is just a hair over 20GB, so no, probably not without a nerfed context window

woadwarrior011y ago

Indeed, the default context length in ollama is a mere 2048 tokens.

hskalin1y ago

With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)

senko1y ago

I'm doing that with a 12GB card, ollama supports it out of the box.

For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.

Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.

dockerd1y ago

Does it work on LM Studio? Loading 27b-it-qat taking up more than 22GB on 24GB mac.

abawany1y ago

I tried the 27b-iat model on a 4090m with 16gb vram with mostly default args via llama.cpp and it didn't fit - used up the vram and tried to use about 2gb of system ram: performance in this setup was < 5 tps.

halflings1y ago

That's what the chart says yes. 14.1GB VRAM usage for the 27B model.

erichocean1y ago

That's the VRAM required just to load the model weights.

To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.

1 more reply

parched991y ago

I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

Prompt Tokens: 10

Time: 229.089 ms

Speed: 43.7 t/s

Generation Tokens: 41

Time: 959.412 ms

Speed: 42.7 t/s

tbocek1y ago

This is probably due to this: https://github.com/ggml-org/llama.cpp/issues/12637. This GitHub issue is about interleaved sliding window attention (iSWA) not available in llama.cpp for Gemma 3. This could reduce the memory requirements a lot. They mentioned for a certain scenario, going from 62GB to 10GB.

2 more replies

idonotknowwhy1y ago

I didn't realise the 5070 is slower than the 3090. Thanks.

If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.

Also an imatrix gguf like iq4xs might be smaller with better quality

1 more reply

floridianfisher1y ago

Try one of the smaller versions. 27b is too big for your gpu

1 more reply

justanotheratom1y ago· 8 in thread

Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

zamadatix1y ago

There are many such apps, e.g. Mollama, Enclave AI or PrivateLLM or dozens of others, but you could tell me it runs at 1,000,000 tokens/second on an iPhone and I wouldn't care because the largest model version you're going to be able to load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you still need context) and it's just not worth the time to use.

That said, if you really care, it generates faster than reading speed (on an A18 based model at least).

woodson1y ago

Some of these small models still have their uses, e.g. for summarization. Don’t expect them to fully replace ChatGPT.

1 more reply

nolist_policy1y ago

FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with Termux.

Casteil1y ago

Does this turn your phone into a personal space heater too?

Alifatisk1y ago

If you ever ship a private AI app, don't forget to implement the export functionality, please!

idonotknowwhy1y ago

You mean conversations? Just the jsonl of the standard hf dataset format to import into other systems?

1 more reply

nico1y ago

What kind of functionality do you need from the model?

For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second

justanotheratom1y ago

I am looking for structured output at about 100-200 tokens/second on iPhone 14+. Any pointers?

1 more reply

porphyra1y ago· 6 in thread

It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.

ivape1y ago

This is why the "AI hardware cycle is hype" crowd is so wrong. We're not even close, we're basically at ColecoVision/Atari stage of hardware here. It's going be quite a thing when everyone gets a SNES/Genesis.

icedrift1y ago

Capable local models have been usable on Macs for a while now thanks to their unified memory.

dragonwriter1y ago

AI PCs aren't about running the kind of models that take a 3090-class GPU, or even running on GPU at all, but systems where the local end is running something like Phi-3.5-vision-instruct, on system RAM using a CPU with an integrated NPU, which is why the AI PC requirements specify an NPU, a certain amount of processing capacity, and a minimum amount of DDR5/LPDDR5 system RAM.

NorwegianDude1y ago

A 3090 is not a extremely high end GPU. Is a consumer GPU launched in 2020, and even in price and compute it's around a mid-range consumer GPU these days.

The high end consumer card from Nvidia is the RTX 5090, and the professional version of the card is the RTX PRO 6000.

dragonwriter1y ago

For model usability as a binary yes/no, pretty much the only dimension that matters is VRAM, and at 24GB the 3090 is still high end for a consumer NVidia GPUs, yes, the 5090 (and only the 5090) is above it, at 32GB, but 24GB is way ahead of the mid-range.

1 more reply

zapnuk1y ago

A 3090 still costs 1800€. Thats not mid-range by a long shot

The 5070 or 5070ti are mid range. They cost 650/900€.

2 more replies

perching_aix1y ago· 6 in thread

This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.

I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.

However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.

parched991y ago

I think Powershell is a bad test. I've noticed all local models have trouble providing accurate responses to Powershell-related prompts. Strangely, even Microsoft's model, Phi 4, is bad at answering these questions without careful prompting. Though, MS can't even provide accurate PS docs.

My best guess is that there's not enough discussion/development related to Powershell in training data.

fragmede1y ago

Which, like, you'd think Microsoft has an entire team there who's purpose would be to generate good PowerShell for it to train on.

HachiWari81y ago

I tried the 27B QAT model and it hallucinates like crazy. When I ask it for information about some made up person, restaurant, place name, etc., it never says "I don't know about that" and instead seems eager to just make up details. The larger local models like the older Llama 3.3 70B seem better at this, but are also too big to fit on a 24GB GPU.

1 more reply

terhechte1y ago

Local models, due to their size more than big cloud models, favor popular languages rather than more niche ones. They work fantastic for JavaScript, Python, Bash but much worse at less popular things like Clojure, Nim or Haskell. Powershell is probably on the less popular side compared to Js or Bash.

If this is your main use case you can always try to fine tune a model. I maintain a small llm bench of different programming languages and the performance difference between say Python and Rust on some smaller models is up to 70%

perching_aix1y ago

How accessible and viable is model fine-tuning? I'm not in the loop at all unfortunately.

1 more reply

jayavanth1y ago

you should set a lower temperature

mark_l_watson1y ago· 5 in thread

Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.

gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.

I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.

Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.

nxobject1y ago

Fellow owner of a 32GB MBP here: how much memory does it use while resident - or, if swapping happens, do you see the effects in your day to day work? I’m in the awkward position of using on a daily basis a lot of virtualized bloated Windows software (mostly SAS).

mark_l_watson1y ago

I have the usual programs running on my Mac, along with open-codex: Emacs, web browser, terminals, VSCode, etc. Even with large contexts, open-codex with Ollama and Gemma 3 27B QAT does not seem to overload my system.

To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.

pantulis1y ago

How did you manage to run open-codex against a local ollama? I keep getting 400 Errors no matter what I try with the --provider and --model options.

pantulis1y ago

Never mind, found your Leanpub book and followed the instructions and at least I have it running with qwen-2.5. I'll investigate what happens with Gemma.

Tsarp1y ago

What tps are you hitting? And did you have to change KV size?

trebligdivad1y ago· 5 in thread

It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.

simonw1y ago

What are you using to run it? I haven't got image input working yet myself.

trebligdivad1y ago

I'm using llama.cpp - built last night from head; to do image stuff you have to run a separate client they provide, with something like:

./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png

Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.

terhechte1y ago

Image input has been working with LM Studio for quite some time

1 more reply

Havoc1y ago

The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU

slekker1y ago

What's MoE?

2 more replies

wtcactus1y ago· 4 in thread

They keep mentioning the RTX 3090 (with 24 GB VRAM), but the model is only 14.1 GB.

Shouldn’t it fit a 5060 Ti 16GB, for instance?

oktoberpaard1y ago

With a 128K context length and 8 bit KV cache, the 27b model occupies 22 GiB on my system. With a smaller context length you should be able to fit it on a 16 GiB GPU.

jsnell1y ago

Memory is needed for more than just the parameters, e.g. the KV cache.

cubefox1y ago

KV = key-value

Havoc1y ago

Just checked - 19 gigs with 8k context @ q8 kv.Plus another 2.5-ish or so for OS etc.

...so yeah 3090

noodletheworld1y ago· 4 in thread

Am I missing something?

These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.

Is there something more to this, or is just a follow up blog post?

(is it just that ollama finally has partial (no images right?) support? Or something else?)

deepsquirrelnet1y ago

QAT “quantization aware training” means they had it quantized to 4 bits during training rather than after training in full or half precision. It’s supposedly a higher quality, but unfortunately they don’t show any comparisons between QAT and post-training quantization.

noodletheworld1y ago

I understand that, but the qat models (1) are not new uploads.

How is this more significant now than when they were uploaded 2 weeks ago?

Are we expecting new models? I don’t understand the timing. This post feels like it’s two weeks late.

[1] - https://huggingface.co/collections/google/gemma-3-qat-67ee61...

2 more replies

Patrick_Devine1y ago

Ollama has had vision support for Gemma3 since it came out. The implementation is not based on llama.cpp's version.

xnx1y ago

The linked blog post was 2 days ago

Samin1001y ago· 3 in thread

I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!

itake1y ago

I tried to use the -it models for translation, but it completely failed at translating adult content.

I think this means I either have to train the -pt model with my own instruction tuning or use another provider :(

jychang1y ago

Try mradermacher/amoral-gemma3-27B-v2-qat-GGUF

1 more reply

andhuman1y ago

Have you tried Mistral Small 24b?

1 more reply

diggan1y ago· 3 in thread

First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.

croemer1y ago

Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.

nithril1y ago

In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage

claiir1y ago

Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

Wish they showed benchmarks / added quantized versions to the arena! :>

behnamoh1y ago· 3 in thread

This is what local LLMs need—being treated like first-class citizens by the companies that make them.

That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.

mmoskal1y ago

Also ~noone runs h100 at home, ie at batch size 1. What matters is throughput. With 37b active parameters and a massive deployment throughout (per gpu) should be similar to Gemma.

freeamz1y ago

so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.

behnamoh1y ago

half the amount of those dots is what it takes. but also, why compare a 27B model with a +600B? that doesn't make sense.

1 more reply

9999000009991y ago· 3 in thread

Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.

I can imagine this will serve to drive prices for hosted llms lower.

At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).

rafaelmn1y ago

I'd say using a Mac studio with M4 Max and 128 GB RAM will get you way further than 4090 in context size and model size. Cheaper than 2x4090 and less power while being a great overall machine.

I think these consumer GPUs are way too expensive for the amount of memory they pack - and that's intentional price discrimination. Also the builds are gimmicky. It's just not setup for AI models, and the versions that are cost 20k.

AMD has that 128GB RAM strix halo chip but even with soldered ram the bandwidth there is very limited, half of M4 Max, which is half of 4090.

I think this generation of hardware and local models is not there yet - would wait for M5/M6 release.

tootie1y ago

There's certainly room to grow but I'm running Gemma 12b on a 4060 (8GB VRAM) which I bought for gaming and it's a tad slow but still gives excellent results. And it certainly seems software is outpacing hardware right now. The target is making a good enough model that can run on a phone.

retinaros1y ago

two 3090 are the way to go

Alifatisk1y ago· 3 in thread

Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?

Zambyte1y ago

I have found Gemma models are able to produce useful information about more niche subjects that other models like Mistral Small cannot, at the expense of never really saying "I don't know", where other models will, and will instead produce false information.

For example, if I ask mistral small who I am by name, it will say there is no known notable figure by that name before the knowledge cutoff. Gemma 3 will say I am a well known <random profession> and make up facts. On the other hand, I have asked both about local organization in my area that I am involved with, and Gemma 3 could produce useful and factual information, where Mistral Small said it did not know.

nico1y ago

They are multimodal. Havent tried the QAT one yet. But the gemma3s released a few weeks ago are pretty good at processing images and telling you details about what’s in them

itake1y ago

Google claims to have better multi language support, due tokenizer improvements.

jarbus1y ago· 2 in thread

Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.

apples_oranges1y ago

Deepseek does reasoning on my home Linux pc but not sure how power hungry it is

gcr1y ago

what variant? I’d considered DeepSeek far too large for any consumer GPUs

1 more reply

api1y ago· 2 in thread

When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.

simonw1y ago

It makes intuitive sense to me that this would be possible, because LLMs are still mostly opaque black boxes. I expect you could drop a whole hunch of the weights without having a huge impact on quality - maybe you end up mostly ditching the parts that are derived from shitposts on Reddit but keep the bits from Arxiv for example.

(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)

retinaros1y ago

its just bs benchmarks. they are all cheating at this point feeding the data in the training set. doesnt mean the llm arent becoming better but when they all lie...

btbuildem1y ago· 2 in thread

Is 27B the largest QAT Gemma 3? Given these size reductions, it would be amazing to have the 70B!

arnaudsm1y ago

The original Gemma 3 does not have a 70B version.

btbuildem1y ago

Ah thank you

rob_c1y ago· 2 in thread

Given how long between this being released and this community picking up on it... Lol

GaunterODimm1y ago

2days :/...

rob_c1y ago

Given I know people running gemma3 on local devices for over almost a month now this is either a very slow news day or evidence of finger missing the pulse... https://blog.google/technology/developers/gemma-3/

1 more reply

manjunaths1y ago· 1 in thread

I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.

It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.

I just love it.

For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.

I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.

But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.

Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.

mdp20211y ago

> My kid's been using it to feed his school textbooks and ask it questions

Which method are you employing to feed a textbook into the model?

technologesus1y ago· 1 in thread

Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.

Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu

jvictor1181y ago

i've found the vision capabilities are very bad with spatial awareness/reasoning. They seem to know that certain things are in the image, but not where they are relative to each other, their relative sizes, etc.

miki1232111y ago· 1 in thread

What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.

We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.

I would normally say VLLM, but the blog post notably does not mention VLLM support.

PhilippGille1y ago

vLLM lists Gemma 3 as supported, if I'm not mistaken: https://docs.vllm.ai/en/latest/models/supported_models.html#...

briandear1y ago· 1 in thread

The normal Gemma models seem to work fine on Apple silicon with Metal. Am I missing something?

simonw1y ago

These new special editions of those models claim to work better with less memory.

gitroom1y ago· 1 in thread

nice, loving the push with local models lately - always makes me wonder though, you think privacy wins out over speed and convenience in the long run or people just stick with what's quickest?

simonw1y ago

Speed and convenience will definitely win for most people. Hosted LLMs are so cheap these days, and are massively more capable than anything you can fit on even a very beefy ($4,000+) consumer machine.

The privacy concerns are honestly mostly imaginary at this point, too. Plenty of hosted LLM vendors will promise not to train on your data. The bigger threat is if they themselves log data and then have a security incident, but honestly the risk that your own personal machine gets stolen or hacked is a lot higher than that.

mekpro1y ago

Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.

mythz1y ago

The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).

casey21y ago

I don't get the appeal. For LLMs to be useful at all you at least need to bin the the dozen exabit range per token, zettabit/s if you want something usable.

There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.

The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.

Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.

This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.

umajho1y ago

I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)

Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).

(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)

Havoc1y ago

Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.

piyh1y ago

Meta Maverick is crying in the shower getting so handily beat by a model with 15x fewer params

yuweiloopy21y ago

Been using the 27B QAT model for batch processing 50K+ internal documents. The 128K context is game-changing for our legal review pipeline. Though I wish the token generation was faster - at 20tps it's still too slow for interactive use compared to Claude Opus.

ece1y ago

On Hugging Face: https://huggingface.co/collections/google/gemma-3-qat-67ee61...

punnerud1y ago

Just tested the 27B, and it’s not very good at following instructions and is very limited on more complex code problems.

Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.

Ask it to generate SVG, and it’s very simple and almost too dumb.

Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.

CyberShadow1y ago

How does it compare to CodeGemma for programming tasks?

gigel821y ago

FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and 29Gb with 16k context and runs at ~61t/s on my 5090.

XCSme1y ago

So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?

mattfrommars1y ago

anyone had success using Gemma 3 QAT models on Ollama with cline? They just don't work as good compared Gemini 2.0 flash provided by API

anshumankmr1y ago

my trusty RTX 3060 is gonna have its day in the sun... though I have run a bunch of 7B models fairly easily on Ollama.

cheriot1y ago

Is there already a Helium for GPUs?

j / k navigate · click thread line to collapse

276 comments

162 comments · 39 top-level

simonw1y ago· 33 in thread

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

rs1861y ago

Can you quote tps?

And I am not yet talking about context window etc.

simonw1y ago

My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

2 more replies

overfeed1y ago

> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

2 more replies

ein0p1y ago

2 more replies

trees1011y ago

Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model

`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS

`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS

Strange results; the full model gives me slightly more TPS.

1 more reply

k__1y ago

The local LLM is your project manager, the big remote ones are the engineers and designers :D

jonaustin1y ago

On a M4 Max 128GB via LM Studio:

query: "make me a snake game in python with pygame"

(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token

(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token

using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...

1 more reply

starik361y ago

On an A5000 with 24GB, this model typically gets between 20 to 25 tps.

pantulis1y ago

> Can you quote tps?

LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.

3 more replies

a_e_k1y ago

I'm seeing ~38--42 tps on a 4090 in a fresh build of llama.cpp under Fedora 42 on my personal machine.

(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)

DJHenk1y ago

There is another aspect to consider, aside from privacy.

1 more reply

otabdeveloper41y ago

The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.

That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.

("AI assistant" is an evolutionary dead end, and Star Trek be damned.)

bobjordan1y ago

paprots1y ago

The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why? Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?

zorgmonkey1y ago

kgwgk1y ago

Look up 27b in https://ollama.com/library/gemma3/tags

You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M

1 more reply

superkuh1y ago

nolist_policy1y ago

I suspect your "original gemma3:27b" was a quantized model since the non-quantized (16bit) version needs around 54gb.

prvc1y ago

> ~15GB (MLX) leaving plenty of memory for running other apps.

Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?

simonw1y ago

I expect not. On my Mac at least I've found I need a bunch of GB free to have anything else running at all.

1 more reply

tomrod1y ago

Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).

simonw1y ago

MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.

2 more replies

nico1y ago

Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short

Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?

simonw1y ago

The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

1 more reply

tootie1y ago

I'm using 12b and getting seriously verbose answers. It's squeezed into 8GB and takes its sweet time but answers are really solid.

Casteil1y ago

This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.

By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.

1 more reply

littlestymaar1y ago

> and it only uses ~22Gb (via Ollama) or ~15GB (MLX)

Why is the memory use different? Are you using different context size in both set-ups?

simonw1y ago

No idea. MLX is its own thing, optimized for Apple Silicon. Ollama uses GGUFs.

https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?

1 more reply

Patrick_Devine1y ago

The vision tower is 7GB, so I was wondering if you were loading it without vision?

codybontecou1y ago

Can you run the mlx-variation of this model through Ollama so that I can interact with it in Open WebUI?

simonw1y ago

I haven't tried it yet but there's an MLX project that exposes an OpenAI-compatible serving endpoint that should work with Open WebUI: https://github.com/madroidmaq/mlx-omni-server

1 more reply

ygreif1y ago

Do many consumer GPUs have >20 gigabytes RAM? That sounds like a lot to me

mcintyre19941y ago

I don't think so, but Apple's unified memory architecture makes it a possibility for people with Macbook Pros.

emrahOP1y ago· 12 in thread

Available on ollama: https://ollama.com/library/gemma3

jinay1y ago

Make sure you're using the "-it-qat" suffixed models like "gemma3:27b-it-qat"

Zambyte1y ago

Here are the direct links:

https://ollama.com/library/gemma3:27b-it-qat

https://ollama.com/library/gemma3:12b-it-qat

https://ollama.com/library/gemma3:4b-it-qat

https://ollama.com/library/gemma3:1b-it-qat

ein0p1y ago

Thanks. I was wondering why my open-webui said that I already had the model. I bet a lot of people are making the same mistake I did and downloading just the old, post-quantized 27B.

Der_Einzige1y ago

Zambyte1y ago

It is important to know about both to decide between the two for your use case though.

1 more reply

ach9l1y ago

instead of ranting, maybe explain how to make a qat q4 work with images in vllm, afaik it is not yet possible

oezi1y ago

Why is sillytavern the only LLM frontend which matters?

2 more replies

simonw1y ago

Last I looked vLLM didn't work on a Mac.

1 more reply

prometheon11y ago

From the HN guidelines: https://news.ycombinator.com/newsguidelines.html

> Be kind. Don't be snarky.

> Please don't post shallow dismissals, especially of other people's work.

janderson2151y ago

I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?

oezi1y ago

Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?

1 more reply

m00dy1y ago

Ollama is definitely not for production loads but vLLm is.

holografix1y ago· 12 in thread

Could 16gb vram be enough for the 27b QAT version?

jffry1y ago

With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory usage is just a hair over 20GB, so no, probably not without a nerfed context window

woadwarrior011y ago

Indeed, the default context length in ollama is a mere 2048 tokens.

hskalin1y ago

With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)

senko1y ago

I'm doing that with a 12GB card, ollama supports it out of the box.

For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.

Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.

dockerd1y ago

Does it work on LM Studio? Loading 27b-it-qat taking up more than 22GB on 24GB mac.

abawany1y ago

halflings1y ago

That's what the chart says yes. 14.1GB VRAM usage for the 27B model.

erichocean1y ago

That's the VRAM required just to load the model weights.

To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.

1 more reply

parched991y ago

I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

Prompt Tokens: 10

Time: 229.089 ms

Speed: 43.7 t/s

Generation Tokens: 41

Time: 959.412 ms

Speed: 42.7 t/s

tbocek1y ago

2 more replies

idonotknowwhy1y ago

I didn't realise the 5070 is slower than the 3090. Thanks.

If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.

Also an imatrix gguf like iq4xs might be smaller with better quality

1 more reply

floridianfisher1y ago

Try one of the smaller versions. 27b is too big for your gpu

1 more reply

justanotheratom1y ago· 8 in thread

zamadatix1y ago

That said, if you really care, it generates faster than reading speed (on an A18 based model at least).

woodson1y ago

Some of these small models still have their uses, e.g. for summarization. Don’t expect them to fully replace ChatGPT.

1 more reply

nolist_policy1y ago

FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with Termux.

Casteil1y ago

Does this turn your phone into a personal space heater too?

Alifatisk1y ago

If you ever ship a private AI app, don't forget to implement the export functionality, please!

idonotknowwhy1y ago

You mean conversations? Just the jsonl of the standard hf dataset format to import into other systems?

1 more reply

nico1y ago

What kind of functionality do you need from the model?

For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second

justanotheratom1y ago

I am looking for structured output at about 100-200 tokens/second on iPhone 14+. Any pointers?

1 more reply

porphyra1y ago· 6 in thread

ivape1y ago

icedrift1y ago

Capable local models have been usable on Macs for a while now thanks to their unified memory.

dragonwriter1y ago

NorwegianDude1y ago

A 3090 is not a extremely high end GPU. Is a consumer GPU launched in 2020, and even in price and compute it's around a mid-range consumer GPU these days.

The high end consumer card from Nvidia is the RTX 5090, and the professional version of the card is the RTX PRO 6000.

dragonwriter1y ago

1 more reply

zapnuk1y ago

A 3090 still costs 1800€. Thats not mid-range by a long shot

The 5070 or 5070ti are mid range. They cost 650/900€.

2 more replies

perching_aix1y ago· 6 in thread

This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.

I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.

parched991y ago

My best guess is that there's not enough discussion/development related to Powershell in training data.

fragmede1y ago

Which, like, you'd think Microsoft has an entire team there who's purpose would be to generate good PowerShell for it to train on.

HachiWari81y ago

1 more reply

terhechte1y ago

perching_aix1y ago

How accessible and viable is model fine-tuning? I'm not in the loop at all unfortunately.

1 more reply

jayavanth1y ago

you should set a lower temperature

mark_l_watson1y ago· 5 in thread

Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.

gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.

Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.

nxobject1y ago

mark_l_watson1y ago

To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.

pantulis1y ago

How did you manage to run open-codex against a local ollama? I keep getting 400 Errors no matter what I try with the --provider and --model options.

pantulis1y ago

Never mind, found your Leanpub book and followed the instructions and at least I have it running with qwen-2.5. I'll investigate what happens with Gemma.

Tsarp1y ago

What tps are you hitting? And did you have to change KV size?

trebligdivad1y ago· 5 in thread

simonw1y ago

What are you using to run it? I haven't got image input working yet myself.

trebligdivad1y ago

I'm using llama.cpp - built last night from head; to do image stuff you have to run a separate client they provide, with something like:

./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png

Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.

terhechte1y ago

Image input has been working with LM Studio for quite some time

1 more reply

Havoc1y ago

The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU

slekker1y ago

What's MoE?

2 more replies

wtcactus1y ago· 4 in thread

They keep mentioning the RTX 3090 (with 24 GB VRAM), but the model is only 14.1 GB.

Shouldn’t it fit a 5060 Ti 16GB, for instance?

oktoberpaard1y ago

With a 128K context length and 8 bit KV cache, the 27b model occupies 22 GiB on my system. With a smaller context length you should be able to fit it on a 16 GiB GPU.

jsnell1y ago

Memory is needed for more than just the parameters, e.g. the KV cache.

cubefox1y ago

KV = key-value

Havoc1y ago

Just checked - 19 gigs with 8k context @ q8 kv.Plus another 2.5-ish or so for OS etc.

...so yeah 3090

noodletheworld1y ago· 4 in thread

Am I missing something?

These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.

Is there something more to this, or is just a follow up blog post?

(is it just that ollama finally has partial (no images right?) support? Or something else?)

deepsquirrelnet1y ago

noodletheworld1y ago

I understand that, but the qat models (1) are not new uploads.

How is this more significant now than when they were uploaded 2 weeks ago?

Are we expecting new models? I don’t understand the timing. This post feels like it’s two weeks late.

[1] - https://huggingface.co/collections/google/gemma-3-qat-67ee61...

2 more replies

Patrick_Devine1y ago

Ollama has had vision support for Gemma3 since it came out. The implementation is not based on llama.cpp's version.

xnx1y ago

The linked blog post was 2 days ago

Samin1001y ago· 3 in thread

itake1y ago

I tried to use the -it models for translation, but it completely failed at translating adult content.

I think this means I either have to train the -pt model with my own instruction tuning or use another provider :(

jychang1y ago

Try mradermacher/amoral-gemma3-27B-v2-qat-GGUF

1 more reply

andhuman1y ago

Have you tried Mistral Small 24b?

1 more reply

diggan1y ago· 3 in thread

croemer1y ago

nithril1y ago

In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage

claiir1y ago

Wish they showed benchmarks / added quantized versions to the arena! :>

behnamoh1y ago· 3 in thread

This is what local LLMs need—being treated like first-class citizens by the companies that make them.

That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.

mmoskal1y ago

Also ~noone runs h100 at home, ie at batch size 1. What matters is throughput. With 37b active parameters and a massive deployment throughout (per gpu) should be similar to Gemma.

freeamz1y ago

so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.

behnamoh1y ago

half the amount of those dots is what it takes. but also, why compare a 27B model with a +600B? that doesn't make sense.

1 more reply

9999000009991y ago· 3 in thread

I can imagine this will serve to drive prices for hosted llms lower.

At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).

rafaelmn1y ago

I'd say using a Mac studio with M4 Max and 128 GB RAM will get you way further than 4090 in context size and model size. Cheaper than 2x4090 and less power while being a great overall machine.

AMD has that 128GB RAM strix halo chip but even with soldered ram the bandwidth there is very limited, half of M4 Max, which is half of 4090.

I think this generation of hardware and local models is not there yet - would wait for M5/M6 release.

tootie1y ago

retinaros1y ago

two 3090 are the way to go

Alifatisk1y ago· 3 in thread

Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?

Zambyte1y ago

nico1y ago

They are multimodal. Havent tried the QAT one yet. But the gemma3s released a few weeks ago are pretty good at processing images and telling you details about what’s in them

itake1y ago

Google claims to have better multi language support, due tokenizer improvements.

jarbus1y ago· 2 in thread

Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.

apples_oranges1y ago

Deepseek does reasoning on my home Linux pc but not sure how power hungry it is

gcr1y ago

what variant? I’d considered DeepSeek far too large for any consumer GPUs

1 more reply

api1y ago· 2 in thread

simonw1y ago

(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)

retinaros1y ago

its just bs benchmarks. they are all cheating at this point feeding the data in the training set. doesnt mean the llm arent becoming better but when they all lie...

btbuildem1y ago· 2 in thread

Is 27B the largest QAT Gemma 3? Given these size reductions, it would be amazing to have the 70B!

arnaudsm1y ago

The original Gemma 3 does not have a 70B version.

btbuildem1y ago

Ah thank you

rob_c1y ago· 2 in thread

Given how long between this being released and this community picking up on it... Lol

GaunterODimm1y ago

2days :/...

rob_c1y ago

1 more reply

manjunaths1y ago· 1 in thread

It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.

I just love it.

For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.

I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.

But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.

Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.

mdp20211y ago

> My kid's been using it to feed his school textbooks and ask it questions

Which method are you employing to feed a textbook into the model?

technologesus1y ago· 1 in thread

Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu

jvictor1181y ago

miki1232111y ago· 1 in thread

What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.

We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.

I would normally say VLLM, but the blog post notably does not mention VLLM support.

PhilippGille1y ago

vLLM lists Gemma 3 as supported, if I'm not mistaken: https://docs.vllm.ai/en/latest/models/supported_models.html#...

briandear1y ago· 1 in thread

The normal Gemma models seem to work fine on Apple silicon with Metal. Am I missing something?

simonw1y ago

These new special editions of those models claim to work better with less memory.

gitroom1y ago· 1 in thread

nice, loving the push with local models lately - always makes me wonder though, you think privacy wins out over speed and convenience in the long run or people just stick with what's quickest?

simonw1y ago

mekpro1y ago

mythz1y ago

The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).

casey21y ago

I don't get the appeal. For LLMs to be useful at all you at least need to bin the the dozen exabit range per token, zettabit/s if you want something usable.

There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.

This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.

umajho1y ago

(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)

Havoc1y ago

Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.

piyh1y ago

Meta Maverick is crying in the shower getting so handily beat by a model with 15x fewer params

yuweiloopy21y ago

ece1y ago

On Hugging Face: https://huggingface.co/collections/google/gemma-3-qat-67ee61...

punnerud1y ago

Just tested the 27B, and it’s not very good at following instructions and is very limited on more complex code problems.

Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.

Ask it to generate SVG, and it’s very simple and almost too dumb.

Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.

CyberShadow1y ago

How does it compare to CodeGemma for programming tasks?

gigel821y ago

FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and 29Gb with 16k context and runs at ~61t/s on my 5090.

XCSme1y ago

So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?

mattfrommars1y ago

anyone had success using Gemma 3 QAT models on Ollama with cline? They just don't work as good compared Gemini 2.0 flash provided by API

anshumankmr1y ago

my trusty RTX 3060 is gonna have its day in the sun... though I have run a bunch of 7B models fairly easily on Ollama.

cheriot1y ago

Is there already a Helium for GPUs?

j / k navigate · click thread line to collapse