undefined | Better HN

0 pointssyntaxing3y ago0 comments

Supposedly the new update with GPU offloading will bring that up to 10 tokens per second! 1 token per second is painfully slow, that’s about 30s for a sentence.

0 comments

1 comments · 1 top-level

int_19h3y ago

10 tokens / second is what you get running llama-30b entirely on the GPU. A 65b model will be slower than that since there's more compute involved.

j / k navigate · click thread line to collapse