undefined | Better HN

0 pointssaghm9d ago0 comments

This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.

0 comments

51 comments · 8 top-level

rapind9d ago· 32 in thread

> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.

milesvp8d ago

If you think your data isn’t being hoovered up I’d like to point out that every model is possible due to federal crimes committed to obtain the information they were trained on. Regardless of how much you are paying, your data is worth another petty civil infraction.

horacemorace8d ago

A million times this. There is “private” as a corporate-legality licensing perspective. There is “private” as a human concept. The two are seemingly opposite, yet as all the money is focused on the former there’s no airtime left for the latter.

suncemoje8d ago

Then I'm interested if there are any facts as to what ZDR actually means?

1 more reply

wahern7d ago

Copyright violation is not per se a crime. I think a colorable defense of fair use, even if it would fail in a civil trial, would negate the mens rea element. I can't easily find caselaw or articles regarding this, though, as most criminal copyright cases involve straightforward reproduction and distribution schemes. Maybe that's because prosecutors won't press cases that might raise a question of fair use?

But I agree with your larger point. AI companies have copied Uber's aggressive posture, pushing the legal envelope with expectations of positive return. Surely they'll continue doing the same in other areas.

larodi8d ago

The curiosity is that these companies somehow got around crimes and are above law (1) and these crimes mean something in a limited jurisdiction, like copyright laws of USA/Canada are not world’s (2). So it’s all cyberpunk at this point.

aamoscodes9d ago

You can pay, and also use deepseek-v4-flash. OpenRouter even lets you "block" or limit your usage to providers that don't train on data. Since the weights are open, other companies are already serving the model on non-DeepSeek owned hardware: https://openrouter.ai/deepseek/deepseek-v4-flash

fc417fc8028d ago

> OpenRouter even lets you "block" or limit your usage to providers that don't train on data.

More than that, they have various zero data retention options and provide a convenient json list of them.

larodi8d ago

The fact OpenRouter strips https to reroute screams danger already.

1 more reply

rapind8d ago

Good to know. I hadn't checks since early is DS4's launch when they were the only provide (I think maybe there was one other, but they also trained on your data). I see several private options now.

darkmarmot9d ago

Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.

rapind9d ago

Yes, but I think that'll change eventually. If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist. At least that's my theory.

There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.

naikrovek8d ago

> Yes, but I think that'll change eventually.

Maybe people will trust companies, but those companies will rarely deserve that trust. Anyone that pays attention sees breach announcements almost every day. Security is never a concern for these companies until it embarrasses them. Then, as soon as the negative attention fades, security again becomes the second to last priority.

Do not trust companies with any data that is important to you unless the effective management of that data is required by law, and the laws are comprehensive.

1 more reply

rob748d ago

My company has all the code in a private GitLab instance (almost everything else is on AWS, but not GitLab), but they still use Cursor, so our internal code gets sent to whatever AI company the model I select in the dropdown belongs to. Scary if you think about it: if you use Cursor, you don't have to trust only one specific AI company, you have to trust all of them...

pessimizer8d ago

> If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist.

I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.

Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.

jen208d ago

I trust AWS in this space. I'm 100% sure that they will be precisely honoring the terms of service for Bedrock (I've never looked to see whether they claim to train on your data though).

kube-system8d ago

You didn’t look because you subconsciously know you don’t need to. AWS has a solid track record, and the certifications and audits to back it up. and that’s why everyone trusts them including the most extreme of regulated industries.

Bedrock in fact does not train on your data. It was a big deal when it was announced that they share data with Anthropic for Fable, but even then it was gated away where you’d have to explicitly allow it.

rlkf8d ago

> Basically I want Hetzner and OVH to run open model clouds

You can run Qwen3 on OVH already:

<https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>

johndough8d ago

I see that OVH offers Qwen3.5-397B-A17B, which is a bit surprising to me. I thought that EU providers had to comply with the AI act where you have to provide opt-out and information about the training data once the model is sufficiently large (over 10^23 FLOPs, likely the case here), but providing information is not possible since people who train those models only give vague information at best.

Does anyone know if OVH is ignoring the law here, or whether it does not apply for some reason?

nl8d ago

OVH is acting as a "Deployer", not a "Provider", which have special meaning under the AI Act.

There are much less (almost no) disclosure regulations on the deployer.

https://ethicalogic.com/articles/gpai-guide-roles-public-dat...

1 more reply

dofm8d ago

Which law is that?

Not doubting you — just want to read it!

1 more reply

saghmOP8d ago

I'm probably somewhat adjacent to you. I would be happy to pay, but I just don't want to pay any of the companies that are actually offering things right now. I had the $20/month sub for Claude for a couple months, until one day I kept inexplicably getting errors saying I hit the limit even though their site showed my usage at less than half for the session and 8% for the week, and it seemed silly to pay for something that couldn't even properly respect its own measurements. OpenAI sketches me out too much as a company, Cursor feels lackluster when I use it for work from the account they pay for (and now is getting acquired by maybe the only AI company even sketchier than OpenAI), and I wasn't particularly impressed with Gemini or Mistral Vibe either when I tried them on the free tiers either.

rapind8d ago

I was paying around $500 / month on average between multiple providers for over a year. I cancelled one a while ago because of pretty bad service availability (Bet you guess who that is!), which by all reports hasn't improved much.

For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).

gaolei88888d ago

who?

gb2d_hn8d ago

For me it's about the value of my time. I think that it's important that we have open models, but for getting real work done, my time is too valuable to waste it on subpar results or additional agent management when a max plan covers all the use I need. It's not worth quibbling over. If the cost / benefit ratio changes, I'll be looking harder at local set ups, but not at the moment.

Bnjoroge9d ago

You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.

dvngnt_6d ago

Would venace ai work?

bel89d ago

These competent open models you want to use were trained on data from people like you and me.

I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.

yencabulator8d ago

MIT and Apache 2.0 both require attribution, so it's not like limiting to those would help in license compliance.

djmips8d ago

Did you try Claude Fable?

superze8d ago

Hetzner workforce can barely run a mature technology called s3 and you think they will be able to deploy openmodels?

KronisLV8d ago

What mature implementations of S3 are there? MinIO that rugpulled the community, Garage that doesn’t even have proper setup scripts in their Docker containers and expect you to do the init manually, or Zenko cloud server that more or less got abandoned? I think there’s also SeaweedFS which might do better but I’m surprised at how shitty everything seems in this space - surely people aren’t being crazy and either storing their files on the FS directly to expose access to them through their app (hello directory traversal attacks) or storing them in relational DBs (hello wasted bandwidth and bloated backups).

The odd jank extends further, like Sonatype Nexus and some other software hardcodes AWS regions to choose from when configuring the storage even though your self-hosted implementation doesn’t have anything to do with AWS so you just have to come up with fake regions. If the cloud vendors each have to reimplement it because there is nothing as quality as PostgreSQL is for DBs, but for S3, then I’m hardly surprised at the state of things.

chrislusf7d ago

I work on SeaweedFS. Let me know if see any bugs or just create a github issue.

spockz9d ago· 7 in thread

For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

echelon8d ago

I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.

1. The hardware will eventually catch up.

2. This keeps the delta between frontier models smaller.

3. We can still fine tune and own the weights.

4. The models will be more useful, faster, and reliable.

RTX is hobbyist tier, not professional tier.

Gated cloud models from hyperscalers treat us like hobbyists in their own right.

We need equivalent scale models, but open.

zozbot2348d ago

H200s and other enterprise datacenter GPUs are completely overkill in any realistic single- or few-users inference scenario. They're hugely unbalanced towards compute capacity which will go almost entirely unused (i.e. wasted) unless you're running huge batches on a continued basis. I've argued many times that local inference engines should support batched inference on a somewhat smaller scale for a variety of reasons (especially given the unexpected effectiveness of SSD streamed inference with larger-than-RAM models), but even I don't think we can realistically go to 300x or so for real-time inference, which is the range that pencils out quite consistently from a simple roofline model of these datacenter cards.

echelon8d ago

If you're doing professional work in coding or video, you can easily saturate a single H200.

This is what RunPod-type services are for.

For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.

I can rent an H200 for $3.50 an hour. That's INSANELY cheap.

I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.

The ideal solution is models we own run on RunPods leveraging H200s.

I can spend $100-200/day on compute making much more value with the model outputs.

----

edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.

You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.

3 more replies

SR2Z8d ago

That GPU costs 25k which means you really should have a rack to put it in. It's not realistic.

dofm8d ago

Pressure on small model quality and design is absolutely what is needed. There are still gains to be made.

MrLeap8d ago

There's a lot more professionals that have RTX cards than H200s. You're inevitably see more development and experimentation on things actual humans have lmao.

FridgeSeal8d ago

Ah yes, because of all the people at home with computers who have…checks notes…datacentre GPU’s lying around.

markussss8d ago· 2 in thread

My system is quite similar to your, my GPU is a 6950 XT and CPU a Ryzen 5 2600x, same amount of RAM, and I feel your pain. It sounds very similar to my experience from a few months ago. When it comes to tool calling, there are multiple possible issues; some models have borked templates bundled with the model file, some models are not trained on tool calling, some agent harnesses doesn't support the tool call output from the model very well, some quantizations ruin the models' abilities to call tools.

My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama [1], learn a little about the parameters that tune how much VRAM is used [2], look online for jinja template fixes for the model you're testing [3], and choose a model that was designed to do the task you want to achieve, with as high quantization as you can fit. The maximum model size you can run is VRAM + RAM, although you want as little of the model to be in system RAM as possible.

I'm running North Mini Code IQ3_XXS with some tuned parameters to fit my current tasks, and while it is not perfect for everything, it has not failed any tool calls I've asked it to make, or that it figured it should make on its own.

[1]: https://sleepingrobots.com/dreams/stop-using-ollama/

[2]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

[3]: https://gist.github.com/jscott3201/e4b155885cc68c038d6ac8909...

itissid8d ago

Interesting. Making low latency correct tool calls correctly is pretty important in voice AI cascading models(STT LLM TTS). Realtime Models are still 2x the cost and there are only 2 providers openai and google that are in the race. For cost control it has to be cascading models

For llms Sadly the only model right now that fits the bill for LLM is GPT 4.1 and it’s standard in my stack because thinking models have unacceptable latency(>=1 sec) even though they are good at tool calling. The main issue with 4.1 is that it can make still mistakes and prompt prose has to be tuned quite a bit.

I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.

calgoo8d ago

"My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama"

Or at least LM Studio if you want to play around with a lot of different models. Im currently using it with my 7800xt and Vulcan as i found it left my OS more stable ROCm does. I had a few system crashes with ROCm and running out of VRAM for the OS.

ryukoposting9d ago· 1 in thread

I found that, with the heavily quantized Qwen3 models I can cram onto my 3060 Ti, telling the model to use its tools in the system prompt made it a lot more likely to actually do it. YMMV of course, but give it a shot.

saghmOP8d ago

I did try this, and it was pretty hit-or-miss still. I even went as far as configuring context for Zed to inject into all conversations saying stuff like "If you need to read a file, call read_file NOW. Do not say you will read it", and it still didn't really make a huge difference.

boppo18d ago· 1 in thread

I have almost your system specs, how do they work for non-coding stuff like chat/knowledge/discussion? I've been using models to talk through social stuff I'm anxious about but dont want to annoy my friends with and it's been amazing, but I don't want to share that info with google/openai/anthropic anymore. I shouldn't have in the first place, but I couldn't help it, the exercise was too interesting.

fc417fc8028d ago

You can test the open models for yourself using the various router services. Those also make it easy to use providers other than the major players.

redmalang9d ago

Try llama.cpp it seems to be a lot more performant and a lot more hackable. Also I'm surprised how substantial the impact of some of the inference configs (beyond just temp) can have, though this is much more model specific.

alfiedotwtf8d ago

> qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system

Sounds like you were either running at a too-low quant, or you were trying to do Agents with something like Qwen 3.5 9B? Qwen 3.6 27B at Q4_K_M I can have that at running all night after a single one-shot a, Anne when I come back in the morning, it’s done

alexpotato8d ago

> The best "free" experience I've found is using OpenCode with Big Pickle.

They now offer DeepSeek V4 Flash for free and it def feels like a step up.

j / k navigate · click thread line to collapse

0 comments

51 comments · 8 top-level

rapind9d ago· 32 in thread

> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.

milesvp8d ago

horacemorace8d ago

suncemoje8d ago

Then I'm interested if there are any facts as to what ZDR actually means?

1 more reply

wahern7d ago

larodi8d ago

aamoscodes9d ago

fc417fc8028d ago

> OpenRouter even lets you "block" or limit your usage to providers that don't train on data.

More than that, they have various zero data retention options and provide a convenient json list of them.

larodi8d ago

The fact OpenRouter strips https to reroute screams danger already.

1 more reply

rapind8d ago

Good to know. I hadn't checks since early is DS4's launch when they were the only provide (I think maybe there was one other, but they also trained on your data). I see several private options now.

darkmarmot9d ago

Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.

rapind9d ago

Yes, but I think that'll change eventually. If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist. At least that's my theory.

There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.

naikrovek8d ago

> Yes, but I think that'll change eventually.

Do not trust companies with any data that is important to you unless the effective management of that data is required by law, and the laws are comprehensive.

1 more reply

rob748d ago

pessimizer8d ago

> If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist.

jen208d ago

I trust AWS in this space. I'm 100% sure that they will be precisely honoring the terms of service for Bedrock (I've never looked to see whether they claim to train on your data though).

kube-system8d ago

rlkf8d ago

> Basically I want Hetzner and OVH to run open model clouds

You can run Qwen3 on OVH already:

<https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>

johndough8d ago

Does anyone know if OVH is ignoring the law here, or whether it does not apply for some reason?

nl8d ago

OVH is acting as a "Deployer", not a "Provider", which have special meaning under the AI Act.

There are much less (almost no) disclosure regulations on the deployer.

https://ethicalogic.com/articles/gpai-guide-roles-public-dat...

1 more reply

dofm8d ago

Which law is that?

Not doubting you — just want to read it!

1 more reply

saghmOP8d ago

rapind8d ago

gaolei88888d ago

who?

gb2d_hn8d ago

Bnjoroge9d ago

You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.

dvngnt_6d ago

Would venace ai work?

bel89d ago

These competent open models you want to use were trained on data from people like you and me.

I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.

yencabulator8d ago

MIT and Apache 2.0 both require attribution, so it's not like limiting to those would help in license compliance.

djmips8d ago

Did you try Claude Fable?

superze8d ago

Hetzner workforce can barely run a mature technology called s3 and you think they will be able to deploy openmodels?

KronisLV8d ago

chrislusf7d ago

I work on SeaweedFS. Let me know if see any bugs or just create a github issue.

spockz9d ago· 7 in thread

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

echelon8d ago

I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.

1. The hardware will eventually catch up.

2. This keeps the delta between frontier models smaller.

3. We can still fine tune and own the weights.

4. The models will be more useful, faster, and reliable.

RTX is hobbyist tier, not professional tier.

Gated cloud models from hyperscalers treat us like hobbyists in their own right.

We need equivalent scale models, but open.

zozbot2348d ago

echelon8d ago

If you're doing professional work in coding or video, you can easily saturate a single H200.

This is what RunPod-type services are for.

For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.

I can rent an H200 for $3.50 an hour. That's INSANELY cheap.

I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.

The ideal solution is models we own run on RunPods leveraging H200s.

I can spend $100-200/day on compute making much more value with the model outputs.

----

edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.

You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.

3 more replies

SR2Z8d ago

That GPU costs 25k which means you really should have a rack to put it in. It's not realistic.

dofm8d ago

Pressure on small model quality and design is absolutely what is needed. There are still gains to be made.

MrLeap8d ago

There's a lot more professionals that have RTX cards than H200s. You're inevitably see more development and experimentation on things actual humans have lmao.

FridgeSeal8d ago

Ah yes, because of all the people at home with computers who have…checks notes…datacentre GPU’s lying around.

markussss8d ago· 2 in thread

[1]: https://sleepingrobots.com/dreams/stop-using-ollama/

[2]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

[3]: https://gist.github.com/jscott3201/e4b155885cc68c038d6ac8909...

itissid8d ago

I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.

calgoo8d ago

"My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama"

ryukoposting9d ago· 1 in thread

saghmOP8d ago

boppo18d ago· 1 in thread

fc417fc8028d ago

You can test the open models for yourself using the various router services. Those also make it easy to use providers other than the major players.

redmalang9d ago

alfiedotwtf8d ago

> qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system

alexpotato8d ago

> The best "free" experience I've found is using OpenCode with Big Pickle.

They now offer DeepSeek V4 Flash for free and it def feels like a step up.

j / k navigate · click thread line to collapse