undefined | Better HN

0 pointsadrian_b4y ago0 comments

Any hardware adder provides almost for free the overflow detection output (at less than the cost of an extra bit, so less than 1/64 of a 64-bit adder).

So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.

A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".

0 comments

ansible4y ago

> Any hardware adder provides almost for free the overflow detection output (at less than the cost of an extra bit, so less than 1/64 of a 64-bit adder). So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.

Ah, but where you do put that bit that you got for free?

A condition codes register, global to the processor / core state? That worked terrific for single-issue microcontrollers back in the 1980's. Now you need register renaming, and all the expensive logic around that to track which overflow bit is following which previous add operation. That's what's being done now for old ISAs, and it generally disliked for several reasons (complexity being chief among them).

Well, you could stuff that bit into another general purpose register, but then you kind of want to specify 4 registers for the add command. Now where are the bits to encode a 4th register in a new instruction format. RISC-V has room to grow for extensions, but another 5 bits for another register is a big ask.

imtringued4y ago

I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

ansible4y ago

> I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

That sort of design can be done, but that just pushes the problem around. Let's look at the original ARM 64 bit code:

    adds  x12, x6, x10
    adcs  x13, x7, x11

The second add with carry uses the global carry bit, it isn't passed as an argument to the adcs instruction. So if you store the carry bit with the x12 register, you would then need to specify x12 in the adcs instruction on the next line. So you need a new instruction format for adcs that can specify four registers.

You could change the semantics, where the add instructions use one register as the source and the destination, like on x86-64, but that's a whole 'nother discussion on why that is and isn't done on various architectures.

mlyle4y ago

> A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".

1. There's not multiple additions in the recommended sequences. Unsigned is add,bltu; Signed with one known sign is add, blt; Signed in general is add, slt, slti, bne.

2. These instruction sequences are specified so that an instruction decoder can treat these sequences following the add as a "very wide" instruction specifying to check an overflow flag, if a hardware implementation so chooses.

adrian_bOP4y ago

Even if a dedicated comparator can be a little cheaper than a full adder, all CPUs already have full adders/subtractors, so all the comparisons are done by subtraction in the same adder/subtractor.

So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.

mlyle4y ago

> So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.

For the signed overflow case, with compile-time unknown value added, if the processor doesn't do anything fancy to elide the recommended sequence (fusion or substitution of operation) only. Not for the other two cases, and not with fusion.

j / k navigate · click thread line to collapse

0 comments

ansible4y ago

Ah, but where you do put that bit that you got for free?

imtringued4y ago

I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

ansible4y ago

> I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

That sort of design can be done, but that just pushes the problem around. Let's look at the original ARM 64 bit code:

    adds  x12, x6, x10
    adcs  x13, x7, x11

mlyle4y ago

> A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".

1. There's not multiple additions in the recommended sequences. Unsigned is add,bltu; Signed with one known sign is add, blt; Signed in general is add, slt, slti, bne.

adrian_bOP4y ago

Even if a dedicated comparator can be a little cheaper than a full adder, all CPUs already have full adders/subtractors, so all the comparisons are done by subtraction in the same adder/subtractor.

mlyle4y ago

j / k navigate · click thread line to collapse