undefined | Better HN

0 pointsevolve79422y ago0 comments

GPU RAM quantity isn’t typically correlated to inference rate. Precision/quantization levels do affect model size, which will affect inference rate. However, I would expect a smaller model to be faster (less RAM).

0 comments

brucethemoose22y ago

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

j / k navigate · click thread line to collapse

0 pointsevolve79422y ago0 comments

0 comments

brucethemoose22y ago

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

j / k navigate · click thread line to collapse