undefined | Better HN

0 pointsdragontamer6y ago0 comments

True, but in practice, throughput is what you should be thinking about as a low level performance programmer.

CPUs can execute many, many, many instructions in parallel. If all your data fits inside of L1 cache (4-clocks of latency), its actually pretty easy to achieve 2-instructions per clock or more !

Furthermore, modern CPUs are out-of-order processors. So the processor will automatically execute independent instructions to "fill up your latency", at least to some extent.

CPUs have enough space to even handle main memory fetches (over 200+ reorder buffers on Skylake, to handle the 200+ clocks of latency on a DDR4 memory read or write). As long as you have "enough independent work to do," its not too bad. Compilers usually figure out independent chunks of work as they unroll loops for example.

In my experience, the loop accounting (for int i=0; i<100; i++) will all execute inside of that latency in parallel to the work inside of the loop. So there's almost always work to do, at least at the ~5 clocks to 10-clocks worth of "misc" functions in any bit of code.

The hard part is coming up with work to do for ~50ns of latency (ex: DDR4 Reads or Writes).

0 comments

3 comments · 1 top-level

gpderetta6y ago· 2 in thread

Actually no, latency is usually a bottleneck before throughput is.

Edit: for example when accessing an hash table, the hash computation is in the critical path.

dragontamerOP6y ago

> Edit: for example when accessing an hash table, the hash computation is in the critical path.

Hmmm... I think I'm biased a bit because of something I'm writing recently where different iterations of a loop were independent.

In this case, you're right. The hash calculation is on the critical path and therefore is latency bound.

gpderetta6y ago

> Hmmm... I think I'm biased a bit because of something I'm writing recently where different iterations of a loop were independent.

That's a great place to be in :D.

BTW, I haven't tried to get implement an hash function in a while (I remember playing with carryless multiplication), but IIRC 6 clock cycles is not too bad.

1 more reply

j / k navigate · click thread line to collapse