undefined | Better HN

0 pointsWASDx8d ago0 comments

> distributed LLM inference

This seems extremely inefficient considering data transfer between model layers if the model is distributed. I found this project called Petals that claim up to 4 tok/s for a 180B model although its repository hasn't been updated in two years.

https://petals.dev/

0 comments

1 comments · 1 top-level

stymaar8d ago

For token generation, yes: because current-gen LLMs are autoregressive you need to add the inter-node latency for every since token.

For prompt processing it would work though, and it could for diffusion LLMs as well.

j / k navigate · click thread line to collapse