The A5 still uses 2 Cortex A9 cores. Whatever optimizations Apple has done to it will have a minimal impact on performance. The biggest gains will still be obtained on the software side.
I agree in general, though I think Apple is going to continue adding custom DSPs to their package. And that will provide an advantage in those targeted areas that software alone will not be able to bridge.
Though the shift from A8 to A9 and single to dual core and the emphasis on GPU means the CPU speed isn't the key metric it might have once been.
(EDIT: I know that a JavaScript benchmark isn't an example of this, this is more a general comment on people comment on phone specs at the moment).
It is indeed a SMP core (1-4 CPUs), which does out of order execution and register renaming. And it outperforms the Cortex-A8 (an in-order CPU) on a per-clock basis on many workloads.
But it's not an Apple part. Everyone is shipping these things now.
These days it feels like all the major sites rush out the reviews right before launch day with less than a day of actually testing being done. The reviews often sound like just a re-hash of the press release specs.
I think Brian Lam's Wirecutter[0] takes a step in the right direction. Gdgt[1], which has been around for a while, seems interesting, too, as it's user-generated content, so the experiences rendered are those of self-chosen peers.
P.S. For some reasons, I always read your username as the first word of the comment, so it always sounds in my head like you're saying, "Ugh, Anandtech does awesome..." In my head, you're quite an irritable fellow. There's someone else named "yawn" who always seems bored.
[0]: http://thewirecutter.com/ [1]: http://gdgt.com
A DRAM package is then stacked on top of the SoC. Avoiding having to route high-speed DRAM lines on the PCB itself not only saves space but it further reduces memory latency.
Uh... come again? DRAM latency is a function of the analog circuit inside the chip, not the wires you connect to it. At best, you might be able to drive the chips at a higher transfer frequency (but even then, the limit is probably on-package in the DRAM, not due to board trace problems). But that has at best a minimal effect on latency (you're shrinking the handful of transfer cycles at the end of the read).
The advantage of the PoP configuration is precisely that it saves space -- quite a bit of space, and it's a great trick. But this bit is just way off.