undefined | Better HN

0 pointsdakolli1mo ago0 comments

This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.

I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.

You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).

People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.

Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.

Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.

0 comments

6 comments · 6 top-level

zozbot2341mo ago

No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.

With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.

This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.

2 more replies

NitpickLawyer1mo ago

API prices are most likely not subsidised. A brief look at openrouter can tell you that. There are plenty of providers that have 0 reason to subsidise that sell models at roughly the same average price. So the model works for them (or they wouldn't do it otherwise).

1 more reply

CamperBob21mo ago

It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

Not if you're OK with 4-bit quantization. More like $30K-$50K one time.

Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).

2 more replies

nullc1mo ago

> two 4090s is not consumer grade

I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?

I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.

2 more replies

vachina1mo ago

Training to be artisanal coder now.

hparadiz1mo ago

Posts like this are so funny to me. I'm staring at a mountain of old hardware right now that cost about $20k ten years ago. I have to pay someone now to come haul it away. What makes you think the current new hardware won't end up with the same fate.

> Just write your own fkin code people

Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.

3 more replies

j / k navigate · click thread line to collapse

0 comments

6 comments · 6 top-level

zozbot2341mo ago

No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.

With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.

2 more replies

NitpickLawyer1mo ago

1 more reply

CamperBob21mo ago

It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

Not if you're OK with 4-bit quantization. More like $30K-$50K one time.

Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).

2 more replies

nullc1mo ago

> two 4090s is not consumer grade

I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?

I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.

2 more replies

vachina1mo ago

Training to be artisanal coder now.

hparadiz1mo ago

> Just write your own fkin code people

Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.

3 more replies

j / k navigate · click thread line to collapse