Skip to content

Top Best Ask Show New Jobs

Did any processor implement an integer square root instruction? (opens in new tab)

(retrocomputing.stackexchange.com)

251 pointsrwallace2y ago72 comments

72 comments

42 comments · 12 top-level

fjfaase2y ago· 12 in thread

Is it possible in a single clock-cycle. Yes, with a very large lookup table. It is probably possible to reduce the size depending on how many serial logical gates can be executed within the clock-cycle. Think about that the binary root of 10000 is rather similar to that of 100 only with respect to different number of zero's.

Findecanor2y ago

Floating point reciprocal square root estimate (`frsqrte`) instructions are typically implemented as just such a table lookup, indexed by a few bits of the fraction and the LSB of the exponent. The precision is typically limited to similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are expected to do a few Newton-Raphson iterations afterwards if they want more.

You can compute the integer square root in n/2 iterations where n is the number of bits in the source using just shifts and adds. For each step, check if a new bit has to be set in the result n_old by computing

n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1)) + (1 << (bit*2))

Then compare it with the source operand, and if it's greater or equal: 1) set the bit in the result 2) update n2_old with n2_new

It can be done in n/2 or perhaps n clock cycles with a suitable microcode instruction set and ALU. With some effort it can be optimized to reduce n to the index of the leftmost set bit in the operand.

Compare the integer square root algorithm used in "Spacewar!" [1]. So, even by 1960 it should have been possible to implement a square root step instructions for each bit, much like division or multiplication shifts, and progress from this to full-fledged automatic operations by the use of a sub timing network. (I guess, it really depends on the economics of the individual use case, whether the effort does pay off or not, as you would amass a few additional hardware modules to accomplish this.)

[1] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...

m4632y ago

so, dumb question.

do lookups in large tables ever (practically, not theoretically) take one clock cycle?

If there's a large lookup table, it would have to come from memory, which might mean cache and memory hierarchy delays, right?

bagels2y ago

If the table is in silicon, you can avoid this. Not sure if that is done in practice though.

js82y ago

There definitely is a trade-off between memory size and how quickly it can be accessed.

IIRC IBM z/Arch processors (AFAIK they are internally similar to POWER) have clock limited to around 5 GHz or so, so that L1 cache lookup costs only one cycle (a design requirement).

For example, z14 has 5.2 GHz clock rate and 2x128 kB data and instruction L1 caches.

WithinReason2y ago

Sounds like it's possible to run any algorithm in the world in 1 clock cycle.

retrac2y ago

Yes. In theory, any pure function can be turned into a lookup table. And any lookup table that isn't just random numbers can be turned into a more compact algorithm that spends compute to save space.

Such tables may be infeasible, though. While a int8 -> int8 table only needs 256 bytes, an int32 -> int32 needs 16 gigabytes.

ajb2y ago

It isn't, because eventually the size of your logic or table becomes larger than the distance a signal can propagate in one clock tick. Before that, it likely presents practical issues (eg, is it worth dedicating that much silicon)

With a sufficiently large chip and a sufficiently slow clock, sure.

robinduckett2y ago

I like the Quantum BogoSort as a proof of this /s

benlivengood2y ago

It's not as bad for integer square root; you only need to store N^0.5 entries in a greater/lesser-than lookup table: N^2 for all the answers N. Feasible for 16-bit integers, maybe for 32-bit, not for 64-bit.

treprinum2y ago· 7 in thread

Can't you use the sequence 1 + 3 + 5 + ... + 2k + 1 to get the integer square root of any integer number? It's basically the k of the nearest lower number to your number in this sequence.

maxcoder42y ago

Can you explain your idea? Your algorithm is correct by definition, but doing this naively would be very slow (even for 32bit number). At this point it would be much faster to just binsearch it.

__s2y ago

For an example of binsearch algo, I recently dipped into this while switching some code from floats to fixed point arithmetic (reducing overall wasm blob size)

https://github.com/serprex/openEtG/blob/2011007dec2616d1a24d...

Tho I could probably save binary size more by importing Math.sqrt from JS

tomatocracy2y ago

Better might be to use the expansion (x+y)^2=x^2+2xy+y^2 along with the observation that in any base, the square root of a 2n-digit number is at most n digits, as in the common method for calculating a square root "by hand" with pen and paper. If you did this 8 bits at a time then you only need a lookup table for roots of 8bit numbers.

And you would iterate through that sequence? That's exponential time in the bit length of the input...

sublinear2y ago

It's sqrt(n) - 1 additions for the n you're trying to get the integer square root of. Memoization would make it constant time for any lesser n than the greatest n you've done this for. For greater n it's sqrt(new_big_n - prev_big_n) - 1 more additions to memoize.

You're right this isn't practical, but fun to think about. Good refresher for those out of school for a while.

HPsquared2y ago

And this is one of those "embarrassingly parallel" tasks.

sublinear2y ago

Yes. On a desert island we can have the whole village construct this table for newton-raphson guesses.

Combined with a cutting tool attached to a worm drive we will precisely count our turns (big radius crank for extra precision!) and begin manufacture of slide rules. Can never have too many scales and this is just one we shall etch into them!

dahart2y ago· 4 in thread

For an approximate (very rough) answer, as opposed to one accurate to the nearest integer, a right shift by half the number of bits of the leading 1’s position will do, and of course nearly every processor has a shift instruction. I’m not sure how often processors haven’t had something like FLO (Find Leading One) or FFS (Find First Set) instruction, those seem ubiquitous as well.

The super rough approximation for some uses can be approximately as good as an accurate answer. When you just need a decent starting place for some further Newton-Raphson iteration, for example. (Of course the right-shift trick is a nice way to seed a more accurate square root calculation. :P)

lordnacho2y ago

Is this where the DOOM reference comes in? Somewhat famous Internet story by now featuring Carmack and a magic 32 bit number.

epcoa2y ago

You mean Quake 3 and fast inverse square root? No. And it wasn’t Carmack. https://www.beyond3d.com/content/articles/15/

https://en.wikipedia.org/wiki/Fast_inverse_square_root

dahart2y ago

Not really, that’s a very clever trick used on floating point numbers, and does the approximate reciprocal square root.

This right-shift thing is far simpler, not very clever, doesn’t involve magic numbers, and is much more well known than the “Quake trick”. There are easy ways to see it. One would be that multiplying a number by itself approximately doubles the number of digits. Therefore halving the number of digits is approximately the square root. You can get more technical and precise by noting that FFS(n) =~ log2(n), and if you remember logs, you know that exp(log(n)/2) = n^(1/2), so shifting right by FFS(n)/2 is just mathematically approximately a square root.

Fun fact, FFS (and its generalization, FNS) is in CUDA: https://docs.nvidia.com/cuda/cuda-math-api/index.html#group_...

Another nice CUDA hardware intrinsic I like is log2.

corsix2y ago· 3 in thread

AArch64 NEON has the URSQRTE instruction, which gets closer to the OP's question than you might think; view a 32-bit value as a fixed-precision integer with 32 fractional bits (so the representable range is evenly spaced 0 through 1-ε, where ε=2^-32), then URSQRTE computes the approximate inverse square root, halves it, then clamps it to the range 0 through 1-ε. Fixed-precision integers aren't quite integers, and approximate inverse square root isn't quite square root, but it might get you somewhere close.

The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.

What task benefits from using such a complex instruction so easily dividable in simpler ones for it to be present in aarch64?

colechristensen2y ago

Inverse square root is for normalizing vectors particularly in computer graphics calculations, it needs to be run a whole lot very fast.

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Mot...

Neon is SIMD so I would presume these instructions let you vectorize those calculations and do them in parallel on a lot of data more efficiently than if you broke it down into simpler operations and did them one by one.

moomin2y ago· 3 in thread

You need to read down a bit, but the answer “ENIAC” is hilarious.

So many people assume that everything that came before they were at school was primitive, and barely chugged along :)

A little reading shows the opposite. Most of our smart ideas were already used in 1940s/50s/60s computers, and are recycled on our fab new chips!! Pipelining, out of order exec, multiple cores, etc.

That old-time hardware might have been a bit "chunky" but the architectures used some very smart techniques.

another example is the virtualization that pretty much enabled the whole "cloud" thing came from mainframe architecture in the 60s. Intel and others brought it to consumer grade CPUs.

xyst2y ago

hardware engineer humor lol

Pet_Ant2y ago· 1 in thread

I'm sure the VAX must have? (if we are including microcode)

linksnapzz2y ago

IIRC, VAX had a "factor this quadratic" instruction...

If you wanted to expand the definition of “processor” to electromechanical contraptions, the Friden SRQ could perform square roots using just additions and shifts, with not a single electronic component other than a motor. And since you had to position the decimal points manually, it would _technically_ be an integer operation…

Video: https://youtu.be/o44a1ao5h8w

kelnos2y ago

This bit in an answer further down made me chuckle:

> My implementation of square root using binary search, that doesn't depend on a multiplier. Only basic ALU instructions are used. It is vigorously undocumented. I have no idea what I wrote but it seems to work.

A fine reminder that if we write clever code, we're probably not going to remember how it works.

Dwedit2y ago

2 ^ (1/2 * Log2(X)) = sqrt(X)

You can get a really really rough approximation if you replace Log2(x) with 'count leading zeroes'. With a better approximation of Log(2), you can get closer to the answer.

bryanlarsen2y ago

IIRC, most (all?) fixed point DSP's have a square root instruction and/or helper instructions.

jlarcombe2y ago

Semi-related and of interest to 6502 fans, exhaustive analysis of square root algorithms: https://github.com/TobyLobster/sqrt_test

pajko2y ago

ARM VFP has VSQRT

https://developer.arm.com/documentation/dui0473/m/vfp-instru...

j / k navigate · click thread line to collapse