M2 processors have 128 byte wide cache lines?? That's a big deal. We've been at 64 bytes since what, the Pentium?
64 byte cache lines are there an part of other alignment boundaries for things like atomics, but accessing memory pull down two cache lines at time.
Watching the Andrew Kelly video mentioned above, really drives home the point that even if your compiler automatically optimizes structure ordering, to minimize padding and alignment issues, it can't fix other higher-level decisions. An example being, using two separate lists of structs to maintain their state data, rather than a single list with each struct having an enum to record its state.
AOT'd languages could re-arrange a struct for better locality however the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.
The as-if rule gives an escape hatch, although in practice it is hard to take advantage of, especially without whole program optimization.
All of these things boil down to combinatorial optimization problems (bin packing ring a bell?). And there are no widely available compilers or JITs or whatever that bundle ILP solvers. Thus, what you're really getting with every compiler is a heuristic/approximate solution (to many many combinatorial optimization problems). Decide for yourself whether you're comfortable with your code just being approximately good or if you need to actually understand how your system works.
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size grep . /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size
would be better, but lscpu -C
is more useful.While it's more portable, there are some drawbacks. Drawbacks with this approach are: (1) the statically known cache line size might not be right if you deploy your code to a target CPU different from the one you compiled for, (2) using this value to define structure + member alignment in headers could end up with interesting bugs if different source files including the header are built with different flags. Also: your toolchain might not support it yet.
Not correct. Prefetching has been around for a while, and rather important in optimization.
https://www.kernel.org/doc/html/latest/core-api/unaligned-me... has a lot more general context on alignment and why it's important
[1] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
[2] https://samueleresca.net/analysis-of-what-every-programmer-s...
It forced you to think in terms of: [array of input data -> operation -> array of intermediate data -> operation -> array of final output data]
Our OOP game engine had to transform their OOP data to array of input data before feeding it into operation, basically a lot of unnecessary memory copies. We had to break objects into "operations", which was not intuitive. But, that got rid a lot of memory copies. Only then we managed to get decent performance.
The good thing, by doing this we also get automatic performance increase on the xbox360 because we were consciously ? unconsciously ? optimizing for cache usage.
A while back I had to create a high speed steaming data processor (not a spark cluster and similar creatures), but a c program that could sit in-line in a high speed data stream and match specific patterns and take actions based on the type of pattern that hit. As part of optimizing for speed and throughput a colleague and I did an obnoxious level of experimentation with read sizes (slurps of data) to minimize io wait queues and memory pressure. Being aligned with the cache-line size, either 1x or 2x was the winner. Good low level close to the hardware c fun for sure.
But otherwise this is a good general overview of how caching is useful.