960MiB is prodigiously large for such a microscopic chip, but if it "only" gain 3-5 times latency reduction over external DRAM, it's still very far from a proper L3 implementation, and far behind L2.
Make DRAM work on 1Ghz+, and then you will see miracles. Imagine a fully synchronous on-die DRAM that can sit just behind L1, or even be connected to load registers directly.
The problem is that effective frequencies for memory round-trip haven't got up much since nineties. If you work with 100% cache misses, your mem will still be working at effective frequency of around 100 to 200Mhz