undefined | Better HN

0 pointscjbprime3d ago0 comments

You still have a core misunderstanding. Only one layer of weights is required in memory at a time. A forward pass can be over-simplified as a matrix multiplication of each layer, one at a time.

There is no swapping of working RAM. We're just talking about loading the weights read-only data into RAM on-demand for each layer. It is only as slow as your storage interface.

0 comments

1 comments · 1 top-level

walrus013d ago

I don't think I have a core misunderstanding - I've seen the abysmal tps rate that results from being unable to load an entire model in actual system RAM (not swap space) at the same time. No matter how fast your NVME storage sequential read speed is.

Yes doing that would prevent destruction of an SSD through using disk space as swap RAM, but it will not be a good experience or usable at all. Note that the original post I was replying to referenced "swapping" which is generally meant to mean using system swap space as RAM.

The standard term for loading only portions of a model from disk as needed is memory mapping, not "swapping". https://www.google.com/search?client=firefox-b-d&q=llama-ser... , or same thing if you google "safetensors file memory mapping"

With a model of this large of a size, not being able to hold it in RAM? Even at worst quantization you'd be looking at 1tps or worse.

j / k navigate · click thread line to collapse