So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.
A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".
Ah, but where you do put that bit that you got for free?
A condition codes register, global to the processor / core state? That worked terrific for single-issue microcontrollers back in the 1980's. Now you need register renaming, and all the expensive logic around that to track which overflow bit is following which previous add operation. That's what's being done now for old ISAs, and it generally disliked for several reasons (complexity being chief among them).
Well, you could stuff that bit into another general purpose register, but then you kind of want to specify 4 registers for the add command. Now where are the bits to encode a 4th register in a new instruction format. RISC-V has room to grow for extensions, but another 5 bits for another register is a big ask.
That sort of design can be done, but that just pushes the problem around. Let's look at the original ARM 64 bit code:
adds x12, x6, x10
adcs x13, x7, x11
The second add with carry uses the global carry bit, it isn't passed as an argument to the adcs instruction. So if you store the carry bit with the x12 register, you would then need to specify x12 in the adcs instruction on the next line. So you need a new instruction format for adcs that can specify four registers.You could change the semantics, where the add instructions use one register as the source and the destination, like on x86-64, but that's a whole 'nother discussion on why that is and isn't done on various architectures.
1. There's not multiple additions in the recommended sequences. Unsigned is add,bltu; Signed with one known sign is add, blt; Signed in general is add, slt, slti, bne.
2. These instruction sequences are specified so that an instruction decoder can treat these sequences following the add as a "very wide" instruction specifying to check an overflow flag, if a hardware implementation so chooses.
So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.
For the signed overflow case, with compile-time unknown value added, if the processor doesn't do anything fancy to elide the recommended sequence (fusion or substitution of operation) only. Not for the other two cases, and not with fusion.