undefined | Better HN

0 pointsandy_ppp6mo ago0 comments

How does this work with anything but trivially small context sizes!?

0 comments

Tensor parallelism, so you only need to store a fraction of kv cache per gpu.

j / k navigate · click thread line to collapse

Tensor parallelism, so you only need to store a fraction of kv cache per gpu.

j / k navigate · click thread line to collapse