I'm aware that SRAM is 3-6x less dense, but it isn't uncommon these days to see people with >3x the DRAM they need, so this doesn't strike me as a terribly convincing justification.
I'm also aware that $/GB is insanely high for on-CPU SRAM, but that would also be the case for on-CPU DRAM, which is why DRAM is typically put on a separate die so that its process can be optimized independently. Does the SRAM process just not optimize as well? Does it have insane power/heat requirements? What goes wrong?
Or (puts on tinfoil hat) is JEDEC full of people who design DRAM memory controllers for a living?
Preventing a single virtual memory access (particuarly from a spinning disk) is worth an enormous speed up of the mean access time.
http://www.sisoftware.co.uk/?d=qa&f=mem_hsw
L1: 4 clocks <-- SRAM
L2: 12 clocks <-- SRAM
L3: 36 clocks <-- SRAM
L4: 136 clocks (55ns) <-- eDRAM
DRAM: 193 clocks (80ns) <-- off-die DRAM
Clock: 2.5GHz (dynamic overclocking was disabled)
5cm travel: 1 clock
With SRAM you just have to open the right gate, whereas with DRAM you have to precharge the bitlines, open the word line, wait for the tiny signal to amplify up to logic level, and only then do you get to read it out. Worse, you need tons of logic to re-order memory access to take advantage of multiple accesses on the same word line or that can happen simultaneously in different banks. And you need to refresh each word line periodically, which requires even more logic. There is a reason why the memory controller (not the cache, the controller) is a huge chunk of the die roughly the size of 2 cores!If we assume that L3 and L4 have similar management overhead then this all takes ~100 clock cycles in the comparison above, which dominates the other costs even if we disregard savings due to simpler logic in off-die SRAM (which, when combined with travel time, accounts for 60 cycles).
I still don't understand why off-die SRAM isn't sensible.
But I think the real reason SRAM is not used is because it's trapped in an expensive/low-volume local maximum and there's not enough demand to push it into a cheaper/high-volume state. Caches actually work pretty well.
From that point of view, when the array is large enough, the particular technology you adopt has only minor influence on the access time.