undefined | Better HN

0 pointsCamperBob21d ago0 comments

It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

Not if you're OK with 4-bit quantization. More like $30K-$50K one time.

Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).

0 comments

reissbaker1d ago

RTX 6000 Pro retails for $10k so an 8x is $80k before anything else in the computer, and long-context will have... pretty bad performance (20+ seconds of waiting before any tokens come out), but it's true it technically works.

I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even.

zozbot2341d ago

> I don't think cloud models are going away; the hardware for good perf is expensive

I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical.

reissbaker1d ago

Users do not have an existing $80k of hardware, are not going to buy $80k of hardware for worse performance than paying $100/month, and models are continuing to grow in size while memory grows in price.

zozbot2341d ago

You said you need $80k in hardware for "good performance". I'm saying the local AI inference workflow will be a lot more flexible about performance than that, and can get away with something vastly cheaper and in line with what the user owns already.

otabdeveloper41d ago

> paying $100/month

There will not ever be a monthly subscription for LLM tokens. The economics isn't there.

Local tokens will always be cheaper.

2 more replies

alfiedotwtf1d ago

If 8 x RTX 6000 is getting you 20s before initial token, how are cloud vendors doing this?

CamperBob2OP1d ago

RTX6000s are great but they are several times slower than a real datacenter-grade GPU. They still use DDR memory rather than HBM, for example.

otabdeveloper41d ago

> higher param count models will remain smarter for a looong time

They're not smarter, they just know more stuff.

You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.

The "smarts" comes from post-training, especially around tool use.

anon77251d ago

If the smarts came from post-training, we could show significant gains by doing that post-training again for previous generations of models. But we know that isn’t happening - effective post training is necessary but not sufficient for model performance.

otabdeveloper423h ago

> we could show significant gains by doing that post-training again for previous generations of models

That's what Chinese models are doing, and beating Opus et al.

CamperBob2OP1d ago

You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.

That's one of the biggest remaining head-scratchers in this whole business. You do need all that unrelated stuff to make a good coding model.

Nobody knows why you can't build a coding model by training on nothing but code, CS texts, specifications, and case studies, but so far it appears that you can't.

otabdeveloper411h ago

This one is kind of obvious - because people prompt coding LLMs with natural language. That's unrelated to stuffing the pre-train set with trivia factoids.

An LLM that knows English very well isn't actually very large and certainly not hundreds of billions of parameters.

zozbot2341d ago

4-bit quantization is native for Kimi 2.x series.

CamperBob2OP1d ago

You're right, I was thinking of Qwen. K2.6 will run at UD-Q2_K_XL precision on 4x RTX6000 boards, but I have no idea if it's worthwhile.

j / k navigate · click thread line to collapse

0 comments

reissbaker1d ago

zozbot2341d ago

> I don't think cloud models are going away; the hardware for good perf is expensive

reissbaker1d ago

zozbot2341d ago

otabdeveloper41d ago

> paying $100/month

There will not ever be a monthly subscription for LLM tokens. The economics isn't there.

Local tokens will always be cheaper.

2 more replies

alfiedotwtf1d ago

If 8 x RTX 6000 is getting you 20s before initial token, how are cloud vendors doing this?

CamperBob2OP1d ago

RTX6000s are great but they are several times slower than a real datacenter-grade GPU. They still use DDR memory rather than HBM, for example.

otabdeveloper41d ago

> higher param count models will remain smarter for a looong time

They're not smarter, they just know more stuff.

You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.

The "smarts" comes from post-training, especially around tool use.

anon77251d ago

otabdeveloper423h ago

> we could show significant gains by doing that post-training again for previous generations of models

That's what Chinese models are doing, and beating Opus et al.

CamperBob2OP1d ago

You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.

That's one of the biggest remaining head-scratchers in this whole business. You do need all that unrelated stuff to make a good coding model.

Nobody knows why you can't build a coding model by training on nothing but code, CS texts, specifications, and case studies, but so far it appears that you can't.

otabdeveloper411h ago

This one is kind of obvious - because people prompt coding LLMs with natural language. That's unrelated to stuffing the pre-train set with trivia factoids.

An LLM that knows English very well isn't actually very large and certainly not hundreds of billions of parameters.

zozbot2341d ago

4-bit quantization is native for Kimi 2.x series.

CamperBob2OP1d ago

You're right, I was thinking of Qwen. K2.6 will run at UD-Q2_K_XL precision on 4x RTX6000 boards, but I have no idea if it's worthwhile.

j / k navigate · click thread line to collapse