Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.
So it's a process thing, not "this software is better than that", and it heavily depends on the model.
It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.
Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.
Like seriously… how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!
For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux it’s easy to set up a local build.
If you don’t want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks
https://www.youtube.com/live/G5OVcKO70ns
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
(I haven’t tried the 120B, which I’ve read is significantly better than 20B)
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
This is how all open weight model launches go.
EDIT: The issue is addressed in LM Studio 0.4.9 (build 1), which auto-update wasn't picking up for me for some reason.
https://github.com/ggml-org/llama.cpp/issues/21347#issuecomm...
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
Yes, this is the conclusion I've come to as well. I don't want to continue supporting OpenAI nor Anthropic, but the other models don't seem to be anywhere close yet, despite the hype.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.
My distro (NixOS) has binary packages though...
And there's packages in the AUR (Arch), GURU (Gentoo), and even Debian Unstable. Now, these might be a little behind, but if you care that much you can download binaries from GitHub directly.
What does unsloth-studio bring on top?
Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
Used to be an Ollama user. Everything that you cite as benefits for Ollama is what I was drawn to in the first place as well, then moved on to using llama.cpp directly. Apart from being extremely unethical, The issue is that they try to abstract away a bit too much, especially when LLM model quality is highly affected by a bunch of parameters. Hell you can't tell what quant you're downloading. Can you tell at a glance what size of model's downloaded? Can you tell if it's optimized for your arch? Or what Quant?
`ollama pull gemma4`
(Yes, I know you can add parameters etc. but the point stands because this is sold as noob-friendly. If you are going to be adding cli params to tweak this, then just do the same with llama.cpp?)
That became a big issue when Deep Seek R1 came out because everyone and their mother was making TikToks saying that you can run the full fat model without explaining that it was a distill, which Ollama had abstracted away. Running `ollama run deepseek-r1` means nothing when the quality ranges from useless to super good.
> And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
I'd go so far as to say, I can *GUARANTEE* you're missing out on performance if you are using Ollama, no matter the size of your GPU VRAM. You can get significant improvement if you just run underlying llama.cpp.
Secondly, it's chock full of dark patterns (like the ones above) and anti-open source behavior. For some examples:
1. It mangles GGUF files so other apps can't use them, and you can't access them either without a bunch of work on your end (had to script a way to unmangle these long sha-hashed file names) 2. Ollama conveniently fails contribute improvements back to the original codebase (they don't have to technically thanks to MIT), but they didn't bother assisting llama.cpp in developing multimodal capabilities and features such as iSWA. 3. Any innovations to the do is just piggybacking off of llama.cpp that they try to pass off as their own without contributing back to upstream. When new models come out they post "WIP" publicly while twiddling their thumbs waiting for llama.cpp to do the actual work.
It operates in this weird "middle layer" where it is kind of user friendly but it’s not as user friendly as LM Studio.
After all this, I just couldn't continue using it. If the benefits it provides you are good, then by all means continue.
IMO just finding the most optimal parameters for a models and aliasing them in your cli would be a much better experience ngl, especially now that we have llama-server, a nice webui and hot reloading built into llama.cpp
This is what pushed me away from Ollama. All I wanted was to scp a model from one machine to another so I didn't have to re-download it and waste bandwidth. But Ollama makes it annoying, so I switched to llama.cpp. I did also find slightly better performance on CPU vs Ollama, likely due to compiling with -march=native.
> (they don't have to technically thanks to MIT)
Minor nit: I'm not aware of any license that requires improvements to be upstreamed. Even GPL just requires that you publish derivative source code under the GPL.
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
https://github.com/day50-dev/Petsitter
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
edit: just checked, the ollama version supports everything
$ llcat -u http://localhost:11434 -m gemma4:latest --info
["completion", "vision", "audio", "tools", "thinking"]
so you can just use that.Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
llama.cpp is about 10% faster than LM studio with the same options.
LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.
Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.
Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
The project is just a bit underwhelming overall, it would be way better if they just focused on polishing good UX and fine-tuning, starting from a reasonably up-to-date version of what llama.cpp provides already.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
In some places in the source code they claim sole ownership of the code, when it is highly derivative of that in llama.cpp (having started its life as a llama.cpp frontend). They keep it the same license, however, MIT.
There is no reason to use Ollama as an alternative to llama.cpp, just use the real thing instead.
There is no reason to ever use ollama.
I just checked their docs and can't see anything like it.
Did you mistake the command to just download and load the model?
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
The Gemma models were literally released yesterday. You can’t ask LLMs for advice on these topics and get accurate information.
Please don’t repeat LLM-sourced answers as canonical information
Everyone hated Qwen3.5 at launch too because so many implementations were broken and couldn’t do tool calling.
You need to ignore social media “I tried this and it sucks” echo chambers for new model releases.
Have you tried using the new Gemma 4 models with agentic coding tools?If you do, you might end up agreeing with me.