Exploring How Cache Memory Works (opens in new tab)

(pikuma.com)

129 points_xivi2y ago33 comments

33 comments

27 comments · 12 top-level

hinkley1y ago· 4 in thread

Wait wait wait.

M2 processors have 128 byte wide cache lines?? That's a big deal. We've been at 64 bytes since what, the Pentium?

CyberDildonics1y ago

In practicality intel CPUs have pulled down 128 bytes at a minimum when you access memory for a very long time.

64 byte cache lines are there an part of other alignment boundaries for things like atomics, but accessing memory pull down two cache lines at time.

Tuna-Fish1y ago

Memory fetch length and coherency unit size are different things, and the latter matters much more.

monocasa1y ago

Yeah, 64 bytes is kind of an unstated x86 thing. It'd be hell for them to change that, a lot of perf conscious code aligns to 64 byte boundaries to combat false sharing.

kllrnohj1y ago

all ARM-designed cores are also 64-bytes. It's not just an x86 thing

2 more replies

dangoldin1y ago· 4 in thread

Really cool stuff and a nice introduction but curious how much modern compilers do for you already. Especially if you shift to the JIT world - what ends up being the difference between code where people optimize for this vs write in a style optimized around code readability/reuse/etc.

tux19681y ago

JIT compilers can't compensate for poorly organized data. Ultimately, understanding these low-level concepts, affect high-level algorithm design and selection.

Watching the Andrew Kelly video mentioned above, really drives home the point that even if your compiler automatically optimizes structure ordering, to minimize padding and alignment issues, it can't fix other higher-level decisions. An example being, using two separate lists of structs to maintain their state data, rather than a single list with each struct having an enum to record its state.

kllrnohj1y ago

JIT languages tend to have the worst language-provided locality as they are often accompanied by GCs and lack of value types (there are exceptions to this, but it's broadly the case). And a JIT cannot re-arrange heap memory layout of objects as it must be hot-swappable. This is why despite incredibly huge investments in them such languages just never reach aot performance despite how much theoretical advantage a jit could have.

AOT'd languages could re-arrange a struct for better locality however the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.

gpderetta1y ago

> the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.

The as-if rule gives an escape hatch, although in practice it is hard to take advantage of, especially without whole program optimization.

almostgotcaught1y ago

I wish people would stop saying this. It's like CS-woo the idea that some magical solution exists that saves you from having to think real hard about the hardware, that magical abstractions save the day.

All of these things boil down to combinatorial optimization problems (bin packing ring a bell?). And there are no widely available compilers or JITs or whatever that bundle ILP solvers. Thus, what you're really getting with every compiler is a heuristic/approximate solution (to many many combinatorial optimization problems). Decide for yourself whether you're comfortable with your code just being approximately good or if you need to actually understand how your system works.

eikenberry1y ago· 3 in thread

In case you are wondering about your cache-line size on a Linux box, you can find it in sysfs.. something like..

    cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size

Hello711y ago

  grep . /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size

would be better, but

  lscpu -C

is more useful.

eikenberry1y ago

Didn't know about 'lscpu -C'.. thanks!

wyldfire1y ago

Another (more portable) way to get the cache line size is to use std::hardware_(destructive|constructive)_interference_size from C++17.

While it's more portable, there are some drawbacks. Drawbacks with this approach are: (1) the statically known cache line size might not be right if you deploy your code to a target CPU different from the one you compiled for, (2) using this value to define structure + member alignment in headers could end up with interesting bugs if different source files including the header are built with different flags. Also: your toolchain might not support it yet.

slashdave1y ago· 2 in thread

"On the other hand, data coming from main memory cannot be assumed to be sequential and the data cache implementation will try to only fetch the data that was asked for."

Not correct. Prefetching has been around for a while, and rather important in optimization.

gustavopezzi1y ago

Hi. Gustavo Pezzi here (from pikuma.com). You're absolutely right; I'll make sure to pass this to Gabriel and I'm sure we'll add this detail to the blog post soon. Thank you for the heads up.

slashdave1y ago

Lots of fun stuff here. Last I checked, Intel's prefetching only works in the positive direction! A reason to have loops move forward. Also, number of simultaneous prefetching streams depends on architecture.

branko_d1y ago· 2 in thread

Why is the natural alignment of structs equal to the size of their largest member?

kllrnohj1y ago

To ensure that member is itself still aligned properly in "global space". The start of the struct is assumed to be universally aligned (malloc, etc.. make that a requirement in fact) or aligned for the requirements of the struct itself (eg, array). Thus any offset into the struct only needs to be aligned to the requirements of the largest type.

https://www.kernel.org/doc/html/latest/core-api/unaligned-me... has a lot more general context on alignment and why it's important

jcranmer1y ago

It's not. It's equal the maximum alignment of their members. For primitive types (like integers, floating-point types and pointers), size == alignment on most machines nowadays (although on 32-bit machines, it can be a toss-up whether a 64-bit integer is 64-bit aligned or 32-bit aligned), so it can look like it's based on size though.

wyldfire1y ago

Drepper's "What Every Programmer Should Know About Memory" [1] is a good resource on a similar topic. Not so long ago, there was an analysis done on it in a series of blog posts [2] from a more modern perspective.

[1] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

[2] https://samueleresca.net/analysis-of-what-every-programmer-s...

emschwartz1y ago

In a similar vein, Andrew Kelly, the creator of Zig, gave a nice talk about how to make use of the different speeds of different CPU operations in designing programs: Practical Data-Oriented Design https://vimeo.com/649009599

ptelomere1y ago

Something I've experienced first hand. Programming the ps3 forced you to manually do what CPU caches does in the background, which is why the ps3 was a pain in the butt for programmers who were so used to object-oriented style programming.

It forced you to think in terms of: [array of input data -> operation -> array of intermediate data -> operation -> array of final output data]

Our OOP game engine had to transform their OOP data to array of input data before feeding it into operation, basically a lot of unnecessary memory copies. We had to break objects into "operations", which was not intuitive. But, that got rid a lot of memory copies. Only then we managed to get decent performance.

The good thing, by doing this we also get automatic performance increase on the xbox360 because we were consciously ? unconsciously ? optimizing for cache usage.

DLA1y ago

I learned so much from this blog and from the discussion. HN is so awesome. +1 for learning about lacpu -C here.

A while back I had to create a high speed steaming data processor (not a spark cluster and similar creatures), but a c program that could sit in-line in a high speed data stream and match specific patterns and take actions based on the type of pattern that hit. As part of optimizing for speed and throughput a colleague and I did an obnoxious level of experimentation with read sizes (slurps of data) to minimize io wait queues and memory pressure. Being aligned with the cache-line size, either 1x or 2x was the winner. Good low level close to the hardware c fun for sure.

boshalfoshal1y ago

I think cache coherency protocols are less intuitive and less talked about when people discuss about caching, so it would be nice to have some discussion on that too.

But otherwise this is a good general overview of how caching is useful.

ThatNiceGuyy1y ago

Great article. I have always had an open question in my mind about struct alignment and this explained it very succinctly.

seany621y ago

Super interesting. Thank you!

j / k navigate · click thread line to collapse

33 comments

27 comments · 12 top-level

hinkley1y ago· 4 in thread

Wait wait wait.

M2 processors have 128 byte wide cache lines?? That's a big deal. We've been at 64 bytes since what, the Pentium?

CyberDildonics1y ago

In practicality intel CPUs have pulled down 128 bytes at a minimum when you access memory for a very long time.

64 byte cache lines are there an part of other alignment boundaries for things like atomics, but accessing memory pull down two cache lines at time.

Tuna-Fish1y ago

Memory fetch length and coherency unit size are different things, and the latter matters much more.

monocasa1y ago

Yeah, 64 bytes is kind of an unstated x86 thing. It'd be hell for them to change that, a lot of perf conscious code aligns to 64 byte boundaries to combat false sharing.

kllrnohj1y ago

all ARM-designed cores are also 64-bytes. It's not just an x86 thing

2 more replies

dangoldin1y ago· 4 in thread

tux19681y ago

JIT compilers can't compensate for poorly organized data. Ultimately, understanding these low-level concepts, affect high-level algorithm design and selection.

kllrnohj1y ago

AOT'd languages could re-arrange a struct for better locality however the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.

gpderetta1y ago

> the majority (if not all) languages rigidly require the fields are laid out in the order defined for various reasons.

The as-if rule gives an escape hatch, although in practice it is hard to take advantage of, especially without whole program optimization.

almostgotcaught1y ago

eikenberry1y ago· 3 in thread

In case you are wondering about your cache-line size on a Linux box, you can find it in sysfs.. something like..

    cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size

Hello711y ago

  grep . /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size

would be better, but

  lscpu -C

is more useful.

eikenberry1y ago

Didn't know about 'lscpu -C'.. thanks!

wyldfire1y ago

Another (more portable) way to get the cache line size is to use std::hardware_(destructive|constructive)_interference_size from C++17.

slashdave1y ago· 2 in thread

"On the other hand, data coming from main memory cannot be assumed to be sequential and the data cache implementation will try to only fetch the data that was asked for."

Not correct. Prefetching has been around for a while, and rather important in optimization.

gustavopezzi1y ago

Hi. Gustavo Pezzi here (from pikuma.com). You're absolutely right; I'll make sure to pass this to Gabriel and I'm sure we'll add this detail to the blog post soon. Thank you for the heads up.

slashdave1y ago

branko_d1y ago· 2 in thread

Why is the natural alignment of structs equal to the size of their largest member?

kllrnohj1y ago

https://www.kernel.org/doc/html/latest/core-api/unaligned-me... has a lot more general context on alignment and why it's important

jcranmer1y ago

wyldfire1y ago

[1] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

[2] https://samueleresca.net/analysis-of-what-every-programmer-s...

emschwartz1y ago

ptelomere1y ago

It forced you to think in terms of: [array of input data -> operation -> array of intermediate data -> operation -> array of final output data]

The good thing, by doing this we also get automatic performance increase on the xbox360 because we were consciously ? unconsciously ? optimizing for cache usage.

DLA1y ago

I learned so much from this blog and from the discussion. HN is so awesome. +1 for learning about lacpu -C here.

boshalfoshal1y ago

I think cache coherency protocols are less intuitive and less talked about when people discuss about caching, so it would be nice to have some discussion on that too.

But otherwise this is a good general overview of how caching is useful.

ThatNiceGuyy1y ago

Great article. I have always had an open question in my mind about struct alignment and this explained it very succinctly.

seany621y ago

Super interesting. Thank you!

j / k navigate · click thread line to collapse