Yeah something people don't understand is that the models have become so big it can take minutes to load them from SSDs. If you need to restart a CUDA process for whatever reason, you'd rather want to load the model files from RAM. This means for every GB of VRAM, you also need a GB of system RAM. Then there are things like prefix caching and multi user KV caching. Users generally don't do all their requests one after the other in a short window of time. This means you are better off freeing up VRAM after a minute has passed. If the user sends another request, using the system RAM as cache is still more energy, time and VRAM efficient than recalculation. DDR5 based DRAM is incredibly cheap.
>Regarding the hype of the day: AI specifically, part of it is the rise of wrappers and agents and inference in general that can run on CPU's/leverage system ram.
It has more to do with the dominance of mixture of expert models. Due to expert sparsity the required memory bandwidth drops quite significantly. It is possible to run gpt-oss 20B on a computer with 32GB RAM, a segment that used to be reserved for enthusiasts and developers has now become the mainstream amount of RAM on desktop PCs and mini PCs.
Yeah so if I had to summarize, the problem is that DDR4 is EOL, shifting demand to DDR5. AI demand means people want more than 16GiB RAM (there is actually a flood of used 2x8 GB kits in the laptop DDR5 market). DRAM manufacturers switch to supplying AI data centers and stopped resupplying retailers. Retailers are running out of inventory, leading to a sharp rise in prices. Early production of DDR6 will be out in 2026 with consumer availability in 2027, so there is zero incentive to expand production for DDR5.