undefined | Better HN

0 pointsd110af5ccf5y ago0 comments

You seem to have several fairly fundamental misunderstandings about CPUs at a low level.

> include stack locations in your register renaming scheme

Registers aren't related to the stack. "The" stack is just RAM being accessed in a specific cache friendly pattern, with additional optimizations (if you use specific registers) from the hardware in the form of the stack engine. The compiler explicitly loads and stores to and from the registers named by the ISA. Register renaming has absolutely nothing to do with the stack.

When the CPU can tell that a later instruction doesn't depend on the previous value of a register, it's free to rename it. The result is that two independent registers get used even though only one was ever directly referenced. In reality, there are a _huge_ number of registers available on modern processors. Estimates place Skylake, Zen, and Cortex-X1 at 200+, with the M1 at 600+. The ISA just doesn't provide a way to access them directly. (If you want to read about this, the term to look up is reorder buffer.)

Also, there is a giant out of order buffer for stores waiting to be written back to L1. That buffer does indeed have to keep track of cache locations, which directly map to memory addresses, which sometimes happen to refer to stack locations. So in a sense, what you suggested already exists. (If you want to read about this, the term to look up is store buffer.)

> it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent

That would indeed make things simpler in some cases. However, many operations such as loading a value into a register (ex mov, [addr]) or zeroing it (ex xor eax, eax) explicitly break the dependency chain by definition. Cases where the CPU fails to properly account for this are documented as false dependencies.

> the compiler can then help the cpu keeping all those instruction units fed

The "compiler handles ordering" thing was tried with Itanium. It seems it didn't go so well.

The CPU is free to simultaneously load two different pieces of data into the "same" register and execute two independent instruction streams on that "single" register thanks to renaming. Speculative execution helps when the CPU can't be completely certain that there isn't a dependency.

For particularly complicated sequences, the compiler spilling due to running out of named registers could indeed pose an issue. However, the CPU is free to elide a store followed by a load if it determines that the address is the same. (If you want to read about this, terms to look up include store-to-load forwarding and load-hit-store.)

0 comments

4 comments · 1 top-level

ant6n5y ago· 3 in thread

If you elide a store followed by a load, you can effectively treat memory as registers and include them in your renaming scheme.

I know Itanium didn’t work - but that’s because here the compiler is supposed to do all the reordering work. That’s different from allowing the compiler to explicitly define that instructions are independent by having more registers.

d110af5ccfOP5y ago

The operations are somewhat different though. Store-to-load forwarding is more complicated and doesn't completely eliminate the operation, it just significantly reduces the cycle count when successful.

Although apparently Zen 2 changed this and can pull off zero latency. (https://www.agner.org/forum/viewtopic.php?t=41)

Some general background: (https://travisdowns.github.io/blog/2019/06/11/speed-limits.h...)

ant6n5y ago

Lets just pick a simple example, for an inner loop

    a = m[i+1] + b
    c = m[i+3] + c
    e = m[i+7] + d

assume you only have 3 registers, in a RISKy architecture. Every statement becomes something like

    r1 = *pb      // load c
    r2 = r0[1]    // m[m+1]
    r1 = r1 + r2  // a = ...
    *pa = r1

Since all registers are used, and all but two instructions are dependent, in the assembly the blocks have to follow one another. There`s also spilling of the b,c,d variables, they have to be read from registers (which could be elided). Assuming no re-order buffer, these instructions runs in three cycles (the first two are independent) - even though the top level instructions are independent.

If you want them to run all statements with 4 instructions at a time, you need to have a reorder buffer that covers the whole sequence (12 instructions). (Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.)

Now lets assume you have 6 registers. Now all variables fit in registers and the compiler can easily interleave the code giving a sequence of 3 or 4 independent instructions at a time. If you want to run 4 instructions at the same time, you need no reorder buffer.

This is a kind of specific example, but it shows that if you have more registers (i.e. ARM vs x86), the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer. Or with the same size re-order buffer, its easier to find more independent instructions and keep all the execution units fed. Or, when jumping to some code thats not in pipeline or icache, it allows to sooner run more instructions in parallel, when only a small number of instructions are decoded and in the re-order buffer.

d110af5ccfOP5y ago

I really don't see what you're getting at here. Even limited to only three named registers I don't think the example you provided would pose an issue on x86. (I'm not very familiar with ARM but I don't think it would pose any issue there either.)

In practice, x86_64 works just fine for HPC number crunching code. Outside of some serious number crunching, when are you going to have more live values than named registers, have instruction streams whose output depends on _all_ of those values (which is why they would be live), and also those streams complete so quickly that you stall on the next set of loads? And you have absolutely no other useful work to do? Honestly I think you're being silly.

Historically, I understand that the 32 bit version of x86 did have scheduling challenges surrounding function calls. The 64 bit version of the ISA expanded the number of named registers and (as far as I understand things) it largely resolved the issue.

Also note that typical hardware can sustain a surprisingly large number of loads per clock. You just need to find something useful to do while you wait for the load to complete. In case you really can't there's also SMT. Really though, the PRF and ROB are only so large.

> If you want to run 4 instructions at the same time, you need no reorder buffer.

You always need a reorder buffer if you want to achieve good performance. Among other issues, the compiler can't predict the latency for each load in advance due to caching behavior depending on the runtime state of the full computer system. I previously mentioned Itanium. It's directly relevant here.

> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.

No. You can't just rename registers any longer. A store to memory means the memory model for the ISA gets involved. Things become significantly more complicated. The store buffer exists specifically to deal with such issues efficiently on an OoO core. Seriously, go read about it. It's astoundingly complicated for any OoO core regardless of the ISA.

> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer

Unless I have a serious misunderstanding (I don't design hardware, so I might) everything passes through the reorder buffer. Every instruction is speculative until all previous instructions have retired. (https://news.ycombinator.com/item?id=20165289)

j / k navigate · click thread line to collapse

0 comments

4 comments · 1 top-level

ant6n5y ago· 3 in thread

If you elide a store followed by a load, you can effectively treat memory as registers and include them in your renaming scheme.

d110af5ccfOP5y ago

Although apparently Zen 2 changed this and can pull off zero latency. (https://www.agner.org/forum/viewtopic.php?t=41)

Some general background: (https://travisdowns.github.io/blog/2019/06/11/speed-limits.h...)

ant6n5y ago

Lets just pick a simple example, for an inner loop

    a = m[i+1] + b
    c = m[i+3] + c
    e = m[i+7] + d

assume you only have 3 registers, in a RISKy architecture. Every statement becomes something like

    r1 = *pb      // load c
    r2 = r0[1]    // m[m+1]
    r1 = r1 + r2  // a = ...
    *pa = r1

d110af5ccfOP5y ago

> If you want to run 4 instructions at the same time, you need no reorder buffer.

> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.

> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer

j / k navigate · click thread line to collapse