Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code (opens in new tab)

(ai.georgeliu.com)

407 pointsvbtechguy2mo ago103 comments

103 comments

63 comments · 23 top-level

Someone12342mo ago· 8 in thread

Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.

nerdix2mo ago

I don't think there is any incentive to do so right now because the open models aren't as good. The vast majority of businesses are going to just pay the extra cost for access to a frontier model. The model is what gives them a competitive advantage, not the harness. The harness is a lot easier to replicate than Opus.

There are benefits too. Some developers might learn to use Claude Code outside of work with cheaper models and then advocate for using Claude Code at work (where their companies will just buy access from Anthropic, Bedrock, etc). Similar to how free ESXi licenses for personal use helped infrastructure folks gain skills with that product which created a healthy supply of labor and VMware evangelists that were eager to spread the gospel. Anthropic can't just give away access to Claude models because of cost so there is use in allowing alternative ways for developers to learn how to use Claude Code and develop a workflow with it.

deskamess2mo ago

Are the Claude Code (desktop) models very different from what Bedrock has? I thought you could hook up VSCode (not Claude Desktop) to Bedrock Anthropic models. Are there features in Claude Desktop that are not in VSCode/cli?

chvid2mo ago

Is it not about the same as using OpenCode?

And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?

falcor842mo ago

Well, if they did, it would probably be shooting themselves in the foot, seeing that the Claude Code source is out there now, and people are waiting for an excuse to "clean-room" reimplement and fork it

alfiedotwtf2mo ago

Yet Codex specifically aims out to be compatible with all backends! Up until Gemma 4 though it’s been pretty solid, but totally fails with unknown tool (I’m guessing a template issue)

wyre2mo ago

I think CC is popular because they are catering to the common denominator programmer and are going to continue to do that, not because CC is particularly turn-key.

moomin2mo ago

Right now it suits them down to the ground. You pay for the product and you don’t cost their servers anything.

phainopepla22mo ago

You don't pay anything to use Claude Code as a front end to non-Anthropic models

1 more reply

trvz2mo ago· 5 in thread

  ollama launch claude --model gemma4:26b

datadrivenangel2mo ago

It's amazing how simple this is, and it just works if you have ollama and claude installed!

gcampos2mo ago

You need to increase the context window size or the tool calling feature wont work

mil222mo ago

For those wondering how to do this:

  OLLAMA_CONTEXT_LENGTH=64000 ollama serve

or if you're using the app, open the Ollama app's Settings dialog and adjust there.

Codex also works:

  ollama launch codex --model gemma4:26b

pshirshov2mo ago

For some reason, that doesn't work for me, claude never returns from some ill loop. Nemotron, glm and qwen 3.5 work just fine, gemma - doesn't.

trvz2mo ago

Since that defaults to the q4 variant, try the q8 one:

  ollama launch claude --model gemma4:26b-a4b-it-q8_0

1 more reply

martinald2mo ago· 5 in thread

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

IceWreck2mo ago

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

Yukonv2mo ago

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

2 more replies

functional_dev2mo ago

This confused me at first as well.. inactive experts skip compute, but weights are sill loaded. So memory does not shrink at all.

I found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...

charcircuit2mo ago

You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.

martinald2mo ago

Yes you're right technically, but in reality you'd be swapping them the (vast?) majority in and out per inference request so would create an enormous bottleneck for the use case the author is using for.

2 more replies

NamlchakKhandro2mo ago· 5 in thread

I don't know why people bother with Claude code.

It's so jank, there are far superior cli coding harness out there

loveparade2mo ago

What do you recommend? I've tried both pi and opencode and both are better than claude imo, but I wonder if there are others.

tarruda2mo ago

Codex is the best out-of-box experience, especially due to its builtin sandboxing. Only drawback is that its edit tool requires the LLM to output a diff which only GPTs are trained to do correctly.

2 more replies

dimgl2mo ago

Vagueposting in Hacker News?

z0mghii2mo ago

Can you elaborate what is jank about it?

threethirtytwo2mo ago

it has visual artifacts when inferencing.

d4rkp4ttern2mo ago· 3 in thread

You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:

https://pchalasani.github.io/claude-code-tools/integrations/...

The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:

[1] https://news.ycombinator.com/item?id=47616761

peder2mo ago

Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?

d4rkp4ttern2mo ago

Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.

selectodude2mo ago

I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.

4 more replies

jonplackett2mo ago· 3 in thread

So wait what is the interaction between Gemma and Claude?

unsnap_biceps2mo ago

lm studio offers an Anthropic compatible local endpoint, so you can point Claude code at it and it'll use your local model for it's requests, however, I've had a lot of problems with LM Studio and Claude code losing it's place. It'll think for awhile, come up with a plan, start to do it and then just halt in the middle. I'll ask it to continue and it'll do a small change and get stuck again.

Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.

keerthiko2mo ago

Claude Code is fairly notoriously token inefficient as far as coding agent/harnesses go (i come from aider pre-CC). It's only viable because the Max subscriptions give you approximately unlimited token budget, which resets in a few hours even if you hit the limit. But this also only works because cloud models have massive token windows (1M tokens on opus right now) which is a bit difficult to make happen locally with the VRAM needed.

And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.

4 more replies

mbesto2mo ago

I don't get why I would use Claude Code when OpenCode, Cursor, Zed, etc. all exist, are "free" and work with virtually any llm. Seems like a weird use case unless I'm missing something.

4 more replies

asymmetric2mo ago· 3 in thread

Is a framework desktop with >48GB of RAM a good machine to try this out?

pshirshov2mo ago

Only for chat sessions, not for agentic coding. It's just too slow to be practical (10 minutes to answer a simple question about a 2k LoC project - and that's with a 5070 addon card).

ac292mo ago

This article is about a MoE model with only 4B active parameters, it shouldn't take 10 minutes to answer a question about a small project.

I measured a 4bit quant of this model at 1300t/s prefill and ~60t/s decode on Ryzen 395+.

nl2mo ago

Doesn't the framework desktop have a Ryzen 395 AI? That's a unified memory architecture like the Macs.

2 more replies

seifbenayed19922mo ago· 2 in thread

Local models are finally starting to feel pleasant instead of just "possible." The headless LM Studio flow is especially nice because it makes local inference usable from real tools instead of as a demo.

Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.

Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.

SeriousM2mo ago

How does cloclo differ from pi-mono?

seifbenayed19922mo ago

pi-mono is a great toolkit — coding agent CLI, unified LLM API, web UI, Slack bot, vLLM pods.

cloclo is a runtime for agent toolkits. You plug it into your own agents and it gives them multi-agent orchestration (AICL protocol), 13 providers, skill registry, native browser/docs/phone tools, memory, and an NDJSON bridge. Zero native deps.

drob5182mo ago· 2 in thread

Seems like this might be a great way to do web software testing. We’ve had Selenium and Puppeteer for a long time but they are a bit brittle with respect to the web design. Change something about the design and there’s a high likelihood that a test will break. Seems like this might be able to be smarter about adapting to changes. That’s also a great use for a smaller model like this.

robot_jesus2mo ago

Yeah. I think that's an interesting use case. Especially if I can kick it off or schedule it when I'm not actively working. Inference speed (especially with tool calling involved) won't be great on my machines, but if I schedule nightly usability tests of dev sites while I sleep, that could be really cool.

drob5182mo ago

You’re right about inference speed being a concern. I was assuming it’s a small model but even then, one of the browser automation frameworks is going to be faster.

ttul2mo ago· 2 in thread

I could see a future in which the major AI labs run a local LLM to offload much of the computational effort currently undertaken in the cloud, leaving the heavy lifting to cloud-hosted models and the easier stuff for local inference.

dominotw2mo ago

wouldnt that be counter to their whole business model?

ttul2mo ago

I don't think so. Acquiring hardware for inference is a chokepoint on growth. If they can offload some inference to the customer's machine, that allows them to use more of their online capacity to generate money.

vbtechguyOP2mo ago· 1 in thread

Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.

canyon2892mo ago

This is a nice writeup!

pseudosavant2mo ago· 1 in thread

I want local models to succeed, but today the gap vs cloud models still seems continually too large. Even with a $2k GPU or a $4k MBP, the quality and speed tradeoff usually isn’t sensible.

Credit to Google for releasing Gemma 4, though. I’d love to see local models reach the point where a 32 GB machine can handle high quality agentic coding at a practical speed.

ffsm82mo ago

Fwiw, the real reason we don't have 100+ GB GPUs is because Nvidia likes to segment their markets. They could sell the consumer cards with 200gb gddr RAM on it, they just know that'd eat into their enterprise offering which is quiet literally all their profit margin (which I may add is gargantuan as of 2025)

ashwanth_megas2mo ago

The interesting bottleneck I keep running into isn’t just model quality — it’s lifecycle management of models in constrained environments (load → run → unload patterns, plus routing between different models depending on task type).

Curious if anyone else is exploring per-request model execution rather than keeping models resident all the time.

jedisct12mo ago

Running Gemma 4 with llama.cpp and Swival:

$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64

$ uvx swival --provider llamacpp

Done.

alfiedotwtf2mo ago

PSA: For those getting stuck in a repetitive loop or just stopping without completing a task, try the interactive template. I just tried it now and it's blowing my already impressive results out of the water (llama.cpp):

    --jinja --chat-template-file models/templates/google-gemma-4-31B-it-interleaved.jinja

bicepjai2mo ago

Totally agree lmstudio headless server on a remote machine but control models from your laptop is an amazing workflow. But Gemma 4 was not a good model atleast in my trials “find me the largest text file in all of the current sub folders” it went on a loopy tool call for ever even with Q8

janalsncm2mo ago

Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.

AbuAssar2mo ago

omlx gives better performance than ollama on apple silicon

Imanari2mo ago

How well do the Gemma 4 models perform on agentic coding? What are your impressions?

aetherspawn2mo ago

Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?

Why/why not?

1 more reply

tiku2mo ago

I hate that my M5 with 24 gb has so much trouble with these models. Not getting any good speeds, even with simple models.

inzlab2mo ago

awesome, the lighter the hardware running big softwares the more novelty.

smcleod2mo ago

Did you try the MLX model instead? In general MLX tends provide much better performance than GGUF/Llama.cpp on macOS.

j / k navigate · click thread line to collapse

103 comments

63 comments · 23 top-level

Someone12342mo ago· 8 in thread

nerdix2mo ago

deskamess2mo ago

chvid2mo ago

Is it not about the same as using OpenCode?

And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?

falcor842mo ago

alfiedotwtf2mo ago

Yet Codex specifically aims out to be compatible with all backends! Up until Gemma 4 though it’s been pretty solid, but totally fails with unknown tool (I’m guessing a template issue)

wyre2mo ago

I think CC is popular because they are catering to the common denominator programmer and are going to continue to do that, not because CC is particularly turn-key.

moomin2mo ago

Right now it suits them down to the ground. You pay for the product and you don’t cost their servers anything.

phainopepla22mo ago

You don't pay anything to use Claude Code as a front end to non-Anthropic models

1 more reply

trvz2mo ago· 5 in thread

  ollama launch claude --model gemma4:26b

datadrivenangel2mo ago

It's amazing how simple this is, and it just works if you have ollama and claude installed!

gcampos2mo ago

You need to increase the context window size or the tool calling feature wont work

mil222mo ago

For those wondering how to do this:

  OLLAMA_CONTEXT_LENGTH=64000 ollama serve

or if you're using the app, open the Ollama app's Settings dialog and adjust there.

Codex also works:

  ollama launch codex --model gemma4:26b

pshirshov2mo ago

For some reason, that doesn't work for me, claude never returns from some ill loop. Nemotron, glm and qwen 3.5 work just fine, gemma - doesn't.

trvz2mo ago

Since that defaults to the q4 variant, try the q8 one:

  ollama launch claude --model gemma4:26b-a4b-it-q8_0

1 more reply

martinald2mo ago· 5 in thread

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

IceWreck2mo ago

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

Yukonv2mo ago

2 more replies

functional_dev2mo ago

This confused me at first as well.. inactive experts skip compute, but weights are sill loaded. So memory does not shrink at all.

I found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...

charcircuit2mo ago

You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.

martinald2mo ago

2 more replies

NamlchakKhandro2mo ago· 5 in thread

I don't know why people bother with Claude code.

It's so jank, there are far superior cli coding harness out there

loveparade2mo ago

What do you recommend? I've tried both pi and opencode and both are better than claude imo, but I wonder if there are others.

tarruda2mo ago

Codex is the best out-of-box experience, especially due to its builtin sandboxing. Only drawback is that its edit tool requires the LLM to output a diff which only GPTs are trained to do correctly.

2 more replies

dimgl2mo ago

Vagueposting in Hacker News?

z0mghii2mo ago

Can you elaborate what is jank about it?

threethirtytwo2mo ago

it has visual artifacts when inferencing.

d4rkp4ttern2mo ago· 3 in thread

https://pchalasani.github.io/claude-code-tools/integrations/...

[1] https://news.ycombinator.com/item?id=47616761

peder2mo ago

d4rkp4ttern2mo ago

selectodude2mo ago

I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.

4 more replies

jonplackett2mo ago· 3 in thread

So wait what is the interaction between Gemma and Claude?

unsnap_biceps2mo ago

Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.

keerthiko2mo ago

4 more replies

mbesto2mo ago

I don't get why I would use Claude Code when OpenCode, Cursor, Zed, etc. all exist, are "free" and work with virtually any llm. Seems like a weird use case unless I'm missing something.

4 more replies

asymmetric2mo ago· 3 in thread

Is a framework desktop with >48GB of RAM a good machine to try this out?

pshirshov2mo ago

Only for chat sessions, not for agentic coding. It's just too slow to be practical (10 minutes to answer a simple question about a 2k LoC project - and that's with a 5070 addon card).

ac292mo ago

This article is about a MoE model with only 4B active parameters, it shouldn't take 10 minutes to answer a question about a small project.

I measured a 4bit quant of this model at 1300t/s prefill and ~60t/s decode on Ryzen 395+.

nl2mo ago

Doesn't the framework desktop have a Ryzen 395 AI? That's a unified memory architecture like the Macs.

2 more replies

seifbenayed19922mo ago· 2 in thread

Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.

SeriousM2mo ago

How does cloclo differ from pi-mono?

seifbenayed19922mo ago

pi-mono is a great toolkit — coding agent CLI, unified LLM API, web UI, Slack bot, vLLM pods.

drob5182mo ago· 2 in thread

robot_jesus2mo ago

drob5182mo ago

You’re right about inference speed being a concern. I was assuming it’s a small model but even then, one of the browser automation frameworks is going to be faster.

ttul2mo ago· 2 in thread

dominotw2mo ago

wouldnt that be counter to their whole business model?

ttul2mo ago

vbtechguyOP2mo ago· 1 in thread

Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.

canyon2892mo ago

This is a nice writeup!

pseudosavant2mo ago· 1 in thread

I want local models to succeed, but today the gap vs cloud models still seems continually too large. Even with a $2k GPU or a $4k MBP, the quality and speed tradeoff usually isn’t sensible.

Credit to Google for releasing Gemma 4, though. I’d love to see local models reach the point where a 32 GB machine can handle high quality agentic coding at a practical speed.

ffsm82mo ago

ashwanth_megas2mo ago

Curious if anyone else is exploring per-request model execution rather than keeping models resident all the time.

jedisct12mo ago

Running Gemma 4 with llama.cpp and Swival:

$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64

$ uvx swival --provider llamacpp

Done.

alfiedotwtf2mo ago

    --jinja --chat-template-file models/templates/google-gemma-4-31B-it-interleaved.jinja

bicepjai2mo ago

janalsncm2mo ago

Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.

AbuAssar2mo ago

omlx gives better performance than ollama on apple silicon

Imanari2mo ago

How well do the Gemma 4 models perform on agentic coding? What are your impressions?

aetherspawn2mo ago

Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?

Why/why not?

1 more reply

tiku2mo ago

I hate that my M5 with 24 gb has so much trouble with these models. Not getting any good speeds, even with simple models.

inzlab2mo ago

awesome, the lighter the hardware running big softwares the more novelty.

smcleod2mo ago

Did you try the MLX model instead? In general MLX tends provide much better performance than GGUF/Llama.cpp on macOS.

j / k navigate · click thread line to collapse