undefined | Better HN

0 pointsdelecti1mo ago0 comments

As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.

Though I'm not an expert, maybe my understanding of the memory allocation is wrong.

0 comments

dd8601fn1mo ago

Seems to me if the model and the kv cache are competing for the same pool of memory, then massively compressing the cache necessarily means more ram available for (if it fits) a larger model, no?

delectiOP1mo ago

Yes, but the context is a comparatively smaller part of how much memory is used when running it locally for a single user, vs when running it on a server for public... serving.

j / k navigate · click thread line to collapse

0 comments

dd8601fn1mo ago

Seems to me if the model and the kv cache are competing for the same pool of memory, then massively compressing the cache necessarily means more ram available for (if it fits) a larger model, no?

delectiOP1mo ago

Yes, but the context is a comparatively smaller part of how much memory is used when running it locally for a single user, vs when running it on a server for public... serving.

j / k navigate · click thread line to collapse