That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?
Presumably it would take 4 logical instructions to do the vectored math, vs. 4 logical instructions to do the scalar additions.
I suppose you're looking at a minimum of two registers for the vectored approach, vs. 8 for the scalar approach. Having the result available in separate registers makes them immediately available for use, though.
There's also the overhead of getting the numbers in and out of memory. Loading and storing one word is obviously going to be way better than loading 4 bytes individually.
It seems to me like the vectored approach would be better for algorithms that require iterating through a large dataset in memory. The scalar approach would be better for algorithms that have a bunch of dependent calculations. Perhaps that's an obvious conclusion!
That's pretty neat though. For the large dataset scenario, perhaps you could get a significant speedup on relatively simple architectures such as cortex-m microcontrollers. I suspect that sufficiently modern high end CPUs/compilers wouldn't benefit so much from it, though? All the pipelining, superscaling and caching and whatnot could sufficiently mask the latencies of the memory accesses to the point of being irrelevant. Also, a sufficiently clever compiler could implement the loop with actual SIMD instructions and achieve significantly higher performance than the manual in-register optimization.
This would be a fun way to compute a basic 8-bit checksum on a binary blob in a microcontroller... Not that it would be practically useful because any non-trivially sized blob would be better served with at least a Fletcher checksum if not a full CRC, both of which seemingly lack the necessary associativity.