undefined | Better HN

story

0 pointsdragontamer5y ago0 comments

> At best, your CPU can think a couple dozen cycles ahead of what it is currently executing.

The 200-sized reorder buffer says otherwise.

Loads/stores can be reordered for 200+ different concurrent objects on modern Intel skylake (2015 through 2020) CPUs. And its about to get a bump to 300+ sized reorder buffers in Icelake.

Modern CPUs are designed to "think ahead" almost the entirety of DDR4 RAM Latency, allowing reordering of instructions to keep the CPU pipes as full as possible (at least, if the underlying assembly code has enough ILP to fill the pipelines while waiting for RAM).

> Something like Link Time Optimization can be done trivially with a compiler, but it would take an army of engineers decades of work to be able to implement in hardware.

You might be surprised at what the modern Branch predictor is doing.

If your "call rax" indirect call constantly calls the same location, the branch predictor will remember that location these days.

0 comments

KMag5y ago

With proper profiling (say, reservoir sampling of instructions causing pipeline stalls), and dynamic recompilation/reoptimization like IBM's project DAISY / HP's Dynamo, you may get performance near a modern out-of-order desktop processor at the power budget of a modern in-order low-power chip.

You get instructions scheduled based on actual dynamically measured usage patterns, but you don't pay for dedicated circuits to do it, and you don't re-do those calculations in hardware for every single instruction executed.

It's not a guaranteed win, but I think it's worth exploring.

dragontamerOP5y ago

But once you do that, then you hardware optimize the interpreter, and then its no longer called a "dynamic recompiler", but instead a "frontend to the microcode". :-)

KMag5y ago

No doubt there is still room for a power-hungry out-of-order speed demon of an implementation, but you need to leave the door open for something with approximately the TDP of a very-low-power in-order-processor with performance closer to an out-of-order machine.

branko_d5y ago

Neo: What are you trying to tell me? That I can dodge "call rax"?

Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't need "call rax".

---

Compiler has access to optimizations that are at the higher level of abstraction than what CPU can do. For example, the compiler can eliminate the call completely (i.e. inline the function), or convert a dynamic dispatch into static (if it can prove that an object will always have a specific type at the call site), or decide where to favor small code over fast code (via profile-guided optimization), or even switch from non-optimized code (but with short start-up time) to optimized code mid-execution (tiered compilation in JITs), move computation outside loops (if it can prove that the result is the same in all iterations), and many other things...

saagarjha5y ago

There is no way a compiler can do anything for an indirect call that goes one way for a while and the other afterwards. A branch predictor can get both with if not 100% accuracy about as close to it as you can possibly get.

branko_d5y ago

Sure.

My point was simply that the compiler may be in position to disprove the assumption that this call is in fact dynamic (it may actually be static) or that it has to be a call in the first place (and inline the function instead).

I'm certainly not arguing against branch predictors.

j / k navigate · click thread line to collapse

0 comments

KMag5y ago

It's not a guaranteed win, but I think it's worth exploring.

dragontamerOP5y ago

But once you do that, then you hardware optimize the interpreter, and then its no longer called a "dynamic recompiler", but instead a "frontend to the microcode". :-)

KMag5y ago

branko_d5y ago

Neo: What are you trying to tell me? That I can dodge "call rax"?

Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't need "call rax".

---

saagarjha5y ago

branko_d5y ago

Sure.

I'm certainly not arguing against branch predictors.

j / k navigate · click thread line to collapse