undefined | Better HN

0 pointsedg50003mo ago0 comments

So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.

0 comments

1 comments · 1 top-level

valine3mo ago

Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.

j / k navigate · click thread line to collapse