undefined | Better HN

0 pointssmusamashah2mo ago0 comments

But isn't this happening here https://taalas.com/ already. They have a demo of llama running at 17000 tokens per second https://chatjimmy.ai/

0 comments

11 comments · 3 top-level

gjsman-10002mo ago· 8 in thread

With some research, that chip appears like it would cost about $300-$400 to manufacture, die only.

For an 8B parameter model.

Opus is estimated at 500B-2T parameters. At that scale you’re past reticle limits and need HBM and multi-die packaging, which means you’ve essentially built an inference ASIC (like Groq or Etched) rather than something categorically cheaper than GPUs. The “burned into silicon” advantage mostly evaporates at frontier scale.

mixermachine2mo ago

The cutting edge, max size models will likely stay in the GPU space for a long time. But these models are not needed for most general requests. With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM. Free users will likely only get these kinds of models.

At some point we will get these models in hardware and the cost per token will be minimal.

zozbot2342mo ago

> With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM. Free users will likely only get these kinds of models.

These are exactly the kinds of models that you can easily run locally by repurposing existing hardware. Depending on how much you're willing to wait for the answer, running local even gives you strictly better outcomes for simple Q&A queries.

(Long-context and agentic use cases are admittedly much harder to fit under that model, since non-AI uses for the high-end hardware you'd realistically need for those are rather more limited, and they're hit by the ongoing hardware shortage.)

mixermachine2mo ago

For programmers maybe. I do this too. But think about all the regular users out there. Your dad and your mum, maybe even your grandparents. This is a huge marked too and for that we can use these special chips at scale.

tomrod2mo ago

Does the cost scale linearly/superlinearly? What does the $300-$400 price data point tell us with relationship to the parameter density?

No gotchas here. I genuinely don't know that 8B parameters is in a zone with significant decreasing marginal returns -- too far out of my knowledge area but genuinely curious.

avidiax2mo ago

Die size increases cost exponentially, by decreasing chips per wafer and decreasing yield.

I expect that this kind of burned-in model is also very difficult to verify (how do you know if some of the weights are off), and not amenable to partial disablement to increase yield. For CPUs, you just laser disable bad cores. Can't forego part of a neural net.

robkop2mo ago

You can ablate surprisingly large chunks of a model with near to no effect, you can try this easily - download an open weight model in torch.

Obviously it’s not ideal but you could likely have single digit % of all weights affected and still have a useful model (many caveats here: e.g. locality of damaged weights matters, distribution of errors matters, fail high/low matters, …)

hdndjsbbs2mo ago

I mean, you probably can just turn off defective parts of the network. You better believe if this becomes popular they would salvage yields by selling "dumber" chips at a discount.

1 more reply

robkop2mo ago

There’s a lot of tradeoffs to play with, those inference ASICs may not carry the gradient but they are still optimised for larger batches and to run any model. They need enough memory for the weights, wide batch inference, and ideally leftovers for kv cache efficiency.

For personal inference you’re given a lot more room to play in - much of it poorly explored today - enough to concern an argument of cost advantages evaporating

margalabargala2mo ago

You mean the person saying "I won't tell you why" might not know what they're talking about?! Say it ain't so.

pindab0ter2mo ago

I just tried chatjimmy.ai for a bit and while it is absolutely blazingly fast, it's also not a very strong model. I suppose that with time, stronger models will be able to run on such hardware, too.

j / k navigate · click thread line to collapse

0 comments

11 comments · 3 top-level

gjsman-10002mo ago· 8 in thread

With some research, that chip appears like it would cost about $300-$400 to manufacture, die only.

For an 8B parameter model.

mixermachine2mo ago

At some point we will get these models in hardware and the cost per token will be minimal.

zozbot2342mo ago

> With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM. Free users will likely only get these kinds of models.

mixermachine2mo ago

tomrod2mo ago

Does the cost scale linearly/superlinearly? What does the $300-$400 price data point tell us with relationship to the parameter density?

No gotchas here. I genuinely don't know that 8B parameters is in a zone with significant decreasing marginal returns -- too far out of my knowledge area but genuinely curious.

avidiax2mo ago

Die size increases cost exponentially, by decreasing chips per wafer and decreasing yield.

robkop2mo ago

You can ablate surprisingly large chunks of a model with near to no effect, you can try this easily - download an open weight model in torch.

hdndjsbbs2mo ago

I mean, you probably can just turn off defective parts of the network. You better believe if this becomes popular they would salvage yields by selling "dumber" chips at a discount.

1 more reply

robkop2mo ago

For personal inference you’re given a lot more room to play in - much of it poorly explored today - enough to concern an argument of cost advantages evaporating

margalabargala2mo ago

You mean the person saying "I won't tell you why" might not know what they're talking about?! Say it ain't so.

pindab0ter2mo ago

I just tried chatjimmy.ai for a bit and while it is absolutely blazingly fast, it's also not a very strong model. I suppose that with time, stronger models will be able to run on such hardware, too.

j / k navigate · click thread line to collapse