Ha; there's a little architecture grognard subthread on this unrelated topic.
The Pentium 4 L1 cache was a miracle for its time, and once the P4 was clocked to Peak Netburst levels the 2-cycle latency looks really good.
Tradeoffs on modern system are different - a Skylake cache may have 4-5 cycle latency on access, but is 4 times bigger (32 rather than 8KiB), can execute twice as many loads per cycle, and is write-back rather than write-through (more complex to design, but more scalable with lots of cores).
You can still get your ass kicked by a ancient system if you pick just the right pointer-chasing microbenchmark. This had some real implications for regex implementation, given that a straightforward DFA implementation (and many string matching algorithms like Aho-Corasik) are really just pointer-chasing.