undefined | Better HN

0 pointspheggs1d ago0 comments

I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

0 comments

35 comments · 9 top-level

UncleOxidant1d ago· 9 in thread

If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.

gpm1d ago

The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

UncleOxidant1d ago

> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

1 more reply

mannanj1d ago

> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

2 more replies

elorant1d ago

Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.

dannyw1d ago

When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.

bawana22h ago

is it possible that ai companies ordered a bunch of ram just so that models cannot be run locally? they are betting new fabs wont be built before quantum takes hold.

pheggsOP17h ago

I am quite certain that it is delayed on purpose to maximize the gains, but at some point some company will see the huge demand for local ai and will want to eat the cake (given that it is feasible)

verdverm1d ago

I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

UncleOxidant1d ago

> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.

2 more replies

cogman101d ago· 6 in thread

I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.

For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.

twelvechairs1d ago

Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.

matheusmoreira1d ago

> LLM provider that doesnt store or sell their queries

> As long as that is allowed to happen

It won't be. Only we can provide that, and only for ourselves.

1 more reply

eventualcomp1d ago

Where is $50k coming from again?

stingraycharles1d ago

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

1 more reply

cogman101d ago

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.

Tepix1d ago

$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.

You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.

1 more reply

notatoad1d ago· 6 in thread

locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

chatmasta1d ago

Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.

fc417fc8021d ago

> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.

You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.

c7b1d ago

You can get a 128GB Strix Halo for under $3k. Used to be under $2k. Even if you believe it'll be completely obsolete for AI in two years, it'll still be good for many other things. Games for at least several more years, a great home server and/or desktop almost indefinitely. Plus, we might actually reach good enough levels for some AI use cases, if we're not already there.

And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.

oceanplexian1d ago

Yeah, 20 months of Claude Max until they rugpull you. I’m spending 7-10k/month in raw token costs on Claude Max. Having an alternative is a nice insurance policy.

tomr751d ago

people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already

SXX1d ago

You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.

Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.

On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.

fny1d ago· 2 in thread

The RAM requirements are still pretty painful.

yieldcrv1d ago

equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

stingraycharles1d ago

Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.

2 more replies

scosman1d ago· 2 in thread

It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

oceanplexian1d ago

It depends what you’re using it for. Real time interactive Claude code session? No, it’s kind of impractical.

But if you already have agent loops dialed in (For example I have one that uses a browser testing framework), it wouldn’t really affect me at all if it crunched away at 7 tokens per second all night long.

leansensei23h ago

Not really, you can do great things without them. I've been summarizing hundreds of documents. I've added MCP servers to my internal business tools (Elixir apps) and can chat with the Nous Hermes agent over Telegram about pending orders, inventory level, historical product prices, etc., without having to click/dick around with a web UI.

Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!

simplyluke1d ago· 1 in thread

You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.

pheggsOP16h ago

thats true. also, I watched the glm prices and it didnt take long before the prices dropped even lower for some providers. its like another layer of competition between hosters

CamouflagedKiwi1d ago

The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.

stymaar1d ago

Honestly, Qwen3.6 is already what you need for the large majority of tasks.

(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).

fsuts1d ago

Why do you think they are rushing to IPO!!

j / k navigate · click thread line to collapse

0 comments

35 comments · 9 top-level

UncleOxidant1d ago· 9 in thread

gpm1d ago

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

UncleOxidant1d ago

> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

1 more reply

mannanj1d ago

> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

2 more replies

elorant1d ago

Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.

dannyw1d ago

bawana22h ago

is it possible that ai companies ordered a bunch of ram just so that models cannot be run locally? they are betting new fabs wont be built before quantum takes hold.

pheggsOP17h ago

I am quite certain that it is delayed on purpose to maximize the gains, but at some point some company will see the huge demand for local ai and will want to eat the cake (given that it is feasible)

verdverm1d ago

I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

UncleOxidant1d ago

> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

2 more replies

cogman101d ago· 6 in thread

twelvechairs1d ago

matheusmoreira1d ago

> LLM provider that doesnt store or sell their queries

> As long as that is allowed to happen

It won't be. Only we can provide that, and only for ourselves.

1 more reply

eventualcomp1d ago

Where is $50k coming from again?

stingraycharles1d ago

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

1 more reply

cogman101d ago

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

Tepix1d ago

$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.

You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.

1 more reply

notatoad1d ago· 6 in thread

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

chatmasta1d ago

fc417fc8021d ago

> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.

c7b1d ago

oceanplexian1d ago

Yeah, 20 months of Claude Max until they rugpull you. I’m spending 7-10k/month in raw token costs on Claude Max. Having an alternative is a nice insurance policy.

tomr751d ago

people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already

SXX1d ago

You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.

Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.

On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.

fny1d ago· 2 in thread

The RAM requirements are still pretty painful.

yieldcrv1d ago

equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

stingraycharles1d ago

Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.

2 more replies

scosman1d ago· 2 in thread

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

oceanplexian1d ago

It depends what you’re using it for. Real time interactive Claude code session? No, it’s kind of impractical.

leansensei23h ago

Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!

simplyluke1d ago· 1 in thread

pheggsOP16h ago

thats true. also, I watched the glm prices and it didnt take long before the prices dropped even lower for some providers. its like another layer of competition between hosters

CamouflagedKiwi1d ago

The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.

stymaar1d ago

Honestly, Qwen3.6 is already what you need for the large majority of tasks.

fsuts1d ago

Why do you think they are rushing to IPO!!

j / k navigate · click thread line to collapse