undefined | Better HN

0 pointsunderlines1y ago0 comments

llama 3.2 3b, qwen2.5 3B quantized to 4bit runs CPU inference quite fast. You can get a beefier VM and still save a ton of money. Depending on the context token length of this soluion, it's either fast or slow. If it's below 1024 tokens per request, you get around 10 sec delay, if you are at around 128 tokens I guess you would be somewhere at 1 sec for time to first token...

0 comments

1 comments · 1 top-level

Jabrov1y ago

My point still stands … 10 seconds is an eternity, and a 3B model isn’t that performant.

I’d rather not offer a demo at all than offer it with these parameters.

j / k navigate · click thread line to collapse