There are benefits too. Some developers might learn to use Claude Code outside of work with cheaper models and then advocate for using Claude Code at work (where their companies will just buy access from Anthropic, Bedrock, etc). Similar to how free ESXi licenses for personal use helped infrastructure folks gain skills with that product which created a healthy supply of labor and VMware evangelists that were eager to spread the gospel. Anthropic can't just give away access to Claude models because of cost so there is use in allowing alternative ways for developers to learn how to use Claude Code and develop a workflow with it.
And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?
ollama launch claude --model gemma4:26b OLLAMA_CONTEXT_LENGTH=64000 ollama serve
or if you're using the app, open the Ollama app's Settings dialog and adjust there.Codex also works:
ollama launch codex --model gemma4:26bI found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...
It's so jank, there are far superior cli coding harness out there
https://pchalasani.github.io/claude-code-tools/integrations/...
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.
And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.
I measured a 4bit quant of this model at 1300t/s prefill and ~60t/s decode on Ryzen 395+.
Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.
Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.
cloclo is a runtime for agent toolkits. You plug it into your own agents and it gives them multi-agent orchestration (AICL protocol), 13 providers, skill registry, native browser/docs/phone tools, memory, and an NDJSON bridge. Zero native deps.
Credit to Google for releasing Gemma 4, though. I’d love to see local models reach the point where a 32 GB machine can handle high quality agentic coding at a practical speed.
Curious if anyone else is exploring per-request model execution rather than keeping models resident all the time.
$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64
$ uvx swival --provider llamacpp
Done.
--jinja --chat-template-file models/templates/google-gemma-4-31B-it-interleaved.jinjaWhy/why not?