undefined | Better HN

0 pointssfink3y ago0 comments

> Even just XORing the address with a hardcoded, pseudorandom bit pattern would have alleviated the issue.

Erm, no it wouldn't. Two addresses with identical index bits will still collide after you XOR those identical indexes with the (same) random pattern.

You would need to do some bit mixing. The simplest would be to multiply the non-offset bits by a prime. But a multiply is still a lot slower than physically wiring the index bits to the mux.

The problem is a little more nuanced than the article suggests. Notice that you don't actually need any of the data to stay in the cache during the loop—you read and write an address, then never return to it again. So the fact that you overflow your associativity doesn't specifically matter; you're happy to have it evict any or every entry.

I often get these things wrong, but I believe what's happening is that it's getting blocked on the writeback from the cache evictions (because they're now dirty from the fresh write). Which means that the 256 vs 257 distinction is a tiny bit fake; if you have a large enough array, both will eventually slow down to at best the speed of the next lower cache layer. The 257 stride will postpone the reckoning longer if the cache starts out partly clean, since the writes will just pile up in the cache for a while. (Shorter version: the 257 stride will indeed probably go faster, but it's probably better to think of the effective size of the cache rather than thinking of it as just faster vs slower.)

The article carefully sets N to 2*21 to demonstrate the behavior. (The overall conclusion is still correct!)

0 comments

1 comments · 1 top-level

robocat3y ago

I suspect that the article is using a hot cache, and that the slowdown wouldn’t occur with a cold cache. The article implies that it is using a hot cache by saying “The array stops fitting into the L3 cache” :

  Since the cache system uses the lower 6 bits for the offset and the next 12 for the cache line index, we are essentially using just 2^(12−(10−6)) = 2^8 different sets in the L3 cache instead of 2^12, which has the effect of shrinking our L3 cache by a factor of 2^4 = 16. The array stops fitting into the L3 cache (N = 2^21) and spills into the order-of-magnitude slower RAM, which causes the performance to decrease.

j / k navigate · click thread line to collapse

0 comments

1 comments · 1 top-level

robocat3y ago

  Since the cache system uses the lower 6 bits for the offset and the next 12 for the cache line index, we are essentially using just 2^(12−(10−6)) = 2^8 different sets in the L3 cache instead of 2^12, which has the effect of shrinking our L3 cache by a factor of 2^4 = 16. The array stops fitting into the L3 cache (N = 2^21) and spills into the order-of-magnitude slower RAM, which causes the performance to decrease.

j / k navigate · click thread line to collapse