> Each individual iteration: ~4x slower (register spilling)
> Cache pressure: ~2-3x additional penalty (instructions don't fit in L1/L2 cache)
> Combined over a billion iterations: 158,000x total slowdown
I think that "2-3x additional penalty" refers to this:
> The 2.78x code bloat means more instruction cache misses, which compounds the register spilling penalty.
Also, the analysis refers elsewhere to other factors that weren't included in this part.