undefined | Better HN

0 pointsjermaustin11y ago0 comments

You don't use the 405B parameter model at home. I have a lot of luck with 8B and 13B models on a single 3090. You can quantize them down (is that the term) which lowers precision and memory use, but still very usable... most of the time.

If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.

If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.

0 comments

beeboobaa31y ago

I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.

jermaustin1OP1y ago

No, but if you are thinking about edge compute for LLMs, you quantize. Models are getting more efficient, and there are plenty of SLMs and smaller LLMs (like phi-2 or phi-3) that are plenty capable even on a tiny arm device like the current range of RPi "clones".

I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.

3B Phi-3 mini is almost instantaneous in simple responses on my MBP.

When I want longer context windows, I use a hosted service somewhere else, but if I only need 8000 tokens (99% of the time that is MORE than I need), any of my computers from the last 3 years are working just fine for it.

j / k navigate · click thread line to collapse