n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1)) + (1 << (bit*2))
Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new
It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.
[1] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...
do lookups in large tables ever (practically, not theoretically) take one clock cycle?
If there's a large lookup table, it would have to come from memory, which might mean cache and memory hierarchy delays, right?
IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).
For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.
Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.
https://github.com/serprex/openEtG/blob/2011007dec2616d1a24d...
Tho I could probably save binary size more by importing Math.sqrt from JS
You're right this isn't practical, but fun to think about. Good refresher for those out of school for a while.
Combined with a cutting tool attached to a worm drive we will precisely count our turns (big radius crank for extra precision!) and begin manufacture of slide rules. Can never have too many scales and this is just one we shall etch into them!
The super rough approximation for some uses can be approximately as good as an accurate answer. When you just need a decent starting place for some further Newton-Raphson iteration, for example. (Of course the right-shift trick is a nice way to seed a more accurate square root calculation. :P)
This right-shift thing is far simpler, not very clever, doesn’t involve magic numbers, and is much more well known than the “Quake trick”. There are easy ways to see it. One would be that multiplying a number by itself approximately doubles the number of digits. Therefore halving the number of digits is approximately the square root. You can get more technical and precise by noting that FFS(n) =~ log2(n), and if you remember logs, you know that exp(log(n)/2) = n^(1/2), so shifting right by FFS(n)/2 is just mathematically approximately a square root.
Another nice CUDA hardware intrinsic I like is log2.
The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.
https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Mot...
A little reading shows the opposite. Most of our smart ideas were already used in 1940s/50s/60s computers, and are recycled on our fab new chips!! Pipelining, out of order exec, multiple cores, etc.
That old-time hardware might have been a bit "chunky" but the architectures used some very smart techniques.
Video: https://youtu.be/o44a1ao5h8w
> My implementation of square root using binary search, that doesn't depend on a multiplier. Only basic ALU instructions are used. It is vigorously undocumented. I have no idea what I wrote but it seems to work.
A fine reminder that if we write clever code, we're probably not going to remember how it works.
You can get a really really rough approximation if you replace Log2(x) with 'count leading zeroes'. With a better approximation of Log(2), you can get closer to the answer.
https://developer.arm.com/documentation/dui0473/m/vfp-instru...