The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.
I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.
Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.
But I agree with your larger point. AI companies have copied Uber's aggressive posture, pushing the legal envelope with expectations of positive return. Surely they'll continue doing the same in other areas.
More than that, they have various zero data retention options and provide a convenient json list of them.
There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.
Maybe people will trust companies, but those companies will rarely deserve that trust. Anyone that pays attention sees breach announcements almost every day. Security is never a concern for these companies until it embarrasses them. Then, as soon as the negative attention fades, security again becomes the second to last priority.
Do not trust companies with any data that is important to you unless the effective management of that data is required by law, and the laws are comprehensive.
I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.
Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.
Bedrock in fact does not train on your data. It was a big deal when it was announced that they share data with Anthropic for Fable, but even then it was gated away where you’d have to explicitly allow it.
You can run Qwen3 on OVH already:
<https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>
Does anyone know if OVH is ignoring the law here, or whether it does not apply for some reason?
There are much less (almost no) disclosure regulations on the deployer.
https://ethicalogic.com/articles/gpai-guide-roles-public-dat...
For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).
I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.
The odd jank extends further, like Sonatype Nexus and some other software hardcodes AWS regions to choose from when configuring the storage even though your self-hosted implementation doesn’t have anything to do with AWS so you just have to come up with fake regions. If the cloud vendors each have to reimplement it because there is nothing as quality as PostgreSQL is for DBs, but for S3, then I’m hardly surprised at the state of things.
Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.
I haven’t tried any tool that compresses the tokens yet.
1. The hardware will eventually catch up.
2. This keeps the delta between frontier models smaller.
3. We can still fine tune and own the weights.
4. The models will be more useful, faster, and reliable.
RTX is hobbyist tier, not professional tier.
Gated cloud models from hyperscalers treat us like hobbyists in their own right.
We need equivalent scale models, but open.
This is what RunPod-type services are for.
For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.
I can rent an H200 for $3.50 an hour. That's INSANELY cheap.
I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.
The ideal solution is models we own run on RunPods leveraging H200s.
I can spend $100-200/day on compute making much more value with the model outputs.
----
edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.
You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.
My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama [1], learn a little about the parameters that tune how much VRAM is used [2], look online for jinja template fixes for the model you're testing [3], and choose a model that was designed to do the task you want to achieve, with as high quantization as you can fit. The maximum model size you can run is VRAM + RAM, although you want as little of the model to be in system RAM as possible.
I'm running North Mini Code IQ3_XXS with some tuned parameters to fit my current tasks, and while it is not perfect for everything, it has not failed any tool calls I've asked it to make, or that it figured it should make on its own.
[1]: https://sleepingrobots.com/dreams/stop-using-ollama/
[2]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
[3]: https://gist.github.com/jscott3201/e4b155885cc68c038d6ac8909...
For llms Sadly the only model right now that fits the bill for LLM is GPT 4.1 and it’s standard in my stack because thinking models have unacceptable latency(>=1 sec) even though they are good at tool calling. The main issue with 4.1 is that it can make still mistakes and prompt prose has to be tuned quite a bit.
I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.
Or at least LM Studio if you want to play around with a lot of different models. Im currently using it with my 7800xt and Vulcan as i found it left my OS more stable ROCm does. I had a few system crashes with ROCm and running out of VRAM for the OS.
Sounds like you were either running at a too-low quant, or you were trying to do Agents with something like Qwen 3.5 9B? Qwen 3.6 27B at Q4_K_M I can have that at running all night after a single one-shot a, Anne when I come back in the morning, it’s done
They now offer DeepSeek V4 Flash for free and it def feels like a step up.