undefined | Better HN

0 pointsrustybolt4y ago0 comments

If you, for example, want to do addition of four 8-bit integers within a 32-bit register, you have to use similar techniques to stop the carry from propagating.

For example, when x and y are 32-bit integers holding 4 8-bit integers, you can do

    z = (x ^ y) + (x & y) & 0x7f7f7f7f;

Now z holds four 8-bit integers which hold the sum (modulo 256) of the integers of x and y. The bit mask is to stop the carry from propagating.

0 comments

nonsequitur4y ago

This doesn't work because you're not left-shifting (doubling) the carry. But when adding the shifted carry to (x ^ y) we're back to potentially overflowing the highest bits. The solution is to add the highest and the lower bits separately:

  lower = 0x7f7f7f7f;
  highest = ~lower;
  z = ((x & lower) + (y & lower)) ^ ((x ^ y) & highest);

Note this only improves performance for larger container integers.

rustyboltOP4y ago

You're completely right! Typically you don't care about overflow though (and you should use unsigned ints to avoid undefined behavior).

sgtnoodle4y ago

That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

Presumably it would take 4 logical instructions to do the vectored math, vs. 4 logical instructions to do the scalar additions.

I suppose you're looking at a minimum of two registers for the vectored approach, vs. 8 for the scalar approach. Having the result available in separate registers makes them immediately available for use, though.

There's also the overhead of getting the numbers in and out of memory. Loading and storing one word is obviously going to be way better than loading 4 bytes individually.

It seems to me like the vectored approach would be better for algorithms that require iterating through a large dataset in memory. The scalar approach would be better for algorithms that have a bunch of dependent calculations. Perhaps that's an obvious conclusion!

That's pretty neat though. For the large dataset scenario, perhaps you could get a significant speedup on relatively simple architectures such as cortex-m microcontrollers. I suspect that sufficiently modern high end CPUs/compilers wouldn't benefit so much from it, though? All the pipelining, superscaling and caching and whatnot could sufficiently mask the latencies of the memory accesses to the point of being irrelevant. Also, a sufficiently clever compiler could implement the loop with actual SIMD instructions and achieve significantly higher performance than the manual in-register optimization.

This would be a fun way to compute a basic 8-bit checksum on a binary blob in a microcontroller... Not that it would be practically useful because any non-trivially sized blob would be better served with at least a Fletcher checksum if not a full CRC, both of which seemingly lack the necessary associativity.

rustyboltOP4y ago

> That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

The operation itself is not necessarily faster (it could be faster due to pipelining, I think, and it would probably ne faster when storing 8 8-bit ints in a 64-bit int), but it can save the hassle and runtime of packing and unpacking into four variables.

djmips4y ago

This technique was useful on a 68000 in a 4 voice software PCM sampled instrument music player running on interrupts on the Atari ST back in the day.

djmips4y ago

Found the game where I worked. Default hardware bleeps and bloops in first version https://www.youtube.com/watch?v=RGOdHT29Jpc

Four voice digi player using SWAR in second version https://www.youtube.com/watch?v=1GrdvcghDXE

2 more replies

j / k navigate · click thread line to collapse

0 comments

nonsequitur4y ago

  lower = 0x7f7f7f7f;
  highest = ~lower;
  z = ((x & lower) + (y & lower)) ^ ((x ^ y) & highest);

Note this only improves performance for larger container integers.

rustyboltOP4y ago

You're completely right! Typically you don't care about overflow though (and you should use unsigned ints to avoid undefined behavior).

sgtnoodle4y ago

That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

Presumably it would take 4 logical instructions to do the vectored math, vs. 4 logical instructions to do the scalar additions.

There's also the overhead of getting the numbers in and out of memory. Loading and storing one word is obviously going to be way better than loading 4 bytes individually.

rustyboltOP4y ago

> That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

djmips4y ago

This technique was useful on a 68000 in a 4 voice software PCM sampled instrument music player running on interrupts on the Atari ST back in the day.

djmips4y ago

Found the game where I worked. Default hardware bleeps and bloops in first version https://www.youtube.com/watch?v=RGOdHT29Jpc

Four voice digi player using SWAR in second version https://www.youtube.com/watch?v=1GrdvcghDXE

2 more replies

j / k navigate · click thread line to collapse