There are some (very primitive by software standards) "hash functions" implemented in hardware to make that work. IIRC at that time they treat the relevant bits in an address as a vector in GF(2) and multiply by a certain hardcoded invertible matrix. (Of course there are properties for that matrix besides being invertible.) The idea is that if a single bit in the input address changes, all output bits should have equal probability of flipping and not flipping (the avalanche effect).
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
DRAM is using hardware technology that is hard. Each dram bit can be addressed by ranks, banks, chips rows and columns. Depending on stuff, you typically read/load a whole cache line from DRAM. This is nice and easy. However, AFAIU there is delay needed in the chip after reading bytes. This means a read to subsequent cache line would be slow. Therefore, there is a strong reason to lay out subsequent data far away, ideally on different chips.
So you can read cache line #0 that can be on some physical chip #A and if you read line #1 it should not be on chip #A but ideally on a "random" other physical die.
This mapping, of "physical memory address" into actually where the data is on the DDRAM, is very much a hash function. This is usually done by the CPU manufacturer I guess. See the reversing of sandy bridge algo:
https://lackingrhoticity.blogspot.com/2015/04/l3-cache-mappi...
Rowhammer is an attack, where the attacker targets a single physical chip of DRAM and cause its misbehaviour due to keeping it active in unusal patterns. RowHammer is basically about reversing that hashing and exploiting the shortcuts dram manufacturers took when producing the chips.
https://blog.cloudflare.com/every-7-8us-your-computers-memor...
But this is about DRAM. I don't know how it works in Cache, but I don't think there is a special hashing algo. It just uses middle bits to address the cache line, as the parent article says.
Just pointing this out (there are several generations of Power's with improved design) as you mentioned memory controllers. The problem in the blog post is different (aliasing) and I guess that all I can say is that unless some randomness is introduced, every design will have a "worst" access pattern.
Also, main memory has a different hash than L3 cache does, and this hash is designed for sequential access.
Computers are designed almost exclusively around the idea of spatial and temporal locality of access, so it turns out this kind of access pattern doesn't happen all that much, and when it does it's often suboptimal code that's not worth a CPU designer trying to optimize much for anyway. At least not in the primary caches where indexing delay is critical so anything but power of two is prohibitive. It is good to be aware of though, and occasionally it can be noticeable and come in use.
I remember I found an issue years ago we had a supercomputer and the kernel was handing out memory to each process in powers of two when they were allocating memory, because it's memory allocator had a per-CPU magazine that would refill from a larger shared pool in batches, and those batch sizes were powers of two (with powers of two allocation sizes).
So each process would only be given memory that ever indexed into like 1/2 or 1/4 of the cache on its CPU. Other styles of workloads never noticed because you would have lots of irregularity, things allocating and freeing and context switching and all different activity on different CPUs. But on supercomputers you often have a pretty clean memory space with little other activity, and work kicks off by starting off ~identical processes on all CPUs and each allocate large chunks of memory.
Strided access has always been slow, on CPUs, on GPUs, on everything, for the last 30 years. Its a known issue. "Fixing" strided memory means breaking other things (such as in-order memory access). And that's just not worth it.
-------
This access pattern (strided memory access) leads to channel conflicts, bank conflicts, and other uneven memory accesses in GPU space.
This entire behavior around associative caches (CPUs) or channel/bank conflicts (GPUs) is the convention, and has been for decades. I don't see any good reason to change.
If it really is a problem for your code, then you can layout the memory differently in your code. (Ex: GPU programmers layout their structures to be non-power of 2, with innocuous padding in their structures to help mitigate this channel conflict problem)
------
At this point, you're gonna end up breaking all the optimized code for the last 30 years to fix slow code (that always has been slow, and therefore probably doesn't need to be optimized). That's just a bad move.
Yes and no. It's not as if hardware engineers completely stuck our heads in the sand and just let the performance be shitty, they've actually developed very sophisticated prefetchers (in this case, literally called a "stride prefetcher"). Many modern stride prefetchers can detect and launch concurrent lines faults at strides of over a megabyte. Here, we're explicitly exploiting that line faults _do not_ need to happen in order thanks to memory ordering rules provided by the ISA.
While you will still get crumby cache utilization due to hammering a single set, if you are legitimately just going once through a loop and operating on single items (i.e. you're streaming and not returning to items once accessed like in the article), the prefetcher will make a huge difference and get you a decent hit rate.
If the cache is vulnerable to "bad access patterns" when the "bad" access patterns are the obvious go-to access patterns that anyone would go for, then it's the cache's fault. Even just XORing the address with a hardcoded, pseudorandom bit pattern would have alleviated the issue.
Erm, no it wouldn't. Two addresses with identical index bits will still collide after you XOR those identical indexes with the (same) random pattern.
You would need to do some bit mixing. The simplest would be to multiply the non-offset bits by a prime. But a multiply is still a lot slower than physically wiring the index bits to the mux.
The problem is a little more nuanced than the article suggests. Notice that you don't actually need any of the data to stay in the cache during the loop—you read and write an address, then never return to it again. So the fact that you overflow your associativity doesn't specifically matter; you're happy to have it evict any or every entry.
I often get these things wrong, but I believe what's happening is that it's getting blocked on the writeback from the cache evictions (because they're now dirty from the fresh write). Which means that the 256 vs 257 distinction is a tiny bit fake; if you have a large enough array, both will eventually slow down to at best the speed of the next lower cache layer. The 257 stride will postpone the reckoning longer if the cache starts out partly clean, since the writes will just pile up in the cache for a while. (Shorter version: the 257 stride will indeed probably go faster, but it's probably better to think of the effective size of the cache rather than thinking of it as just faster vs slower.)
The article carefully sets N to 2*21 to demonstrate the behavior. (The overall conclusion is still correct!)
Yeah, that's in the same vein as Seznec's 'skew-associative' cache where a cache line would be either direct mapped or mapped based on some hash of its address
From what I understand, a hardware cache wouldn’t store a line multiple times in main memory.
So in that direct mapped example blocks 0, 64, and 256 are all mapped to cache line 0, and you can only have one of them in cache at any given moment. If block 0 is in cache and block 256 gets accessed, block 0 will be evicted from cache because block 256 will replace it in cache line 0.
I suppose you're already familiar with this one: https://archive.org/details/computerarchitectureaquantitativ...
(It might have been HP PA-RISC, from an article in Byte Magazine - but that's digging deep into the old memory murk.)
Might anyone recall more?