undefined | Better HN

0 pointsHextinium3y ago0 comments

My naive guess is that most floating point code uses FP32 and FP64 uses at least double the die size. So optimize for FP32 and have some FP64 for the rare equations that need it.

0 comments

thesz3y ago

These compute units are usually sliced - they can perform either four FP32 multiples or one FP64 multiply on the same die part. This trick was done as long ago as PA-RISC was developed, from what I remember it was HP who introduced sliced ALU, capable of doing one large or several smaller operations on the same hardware.

I can be wrong about who did that first, but most FPUs now are done like that.

my1233y ago

On GPUs, they're not sliced like this anymore since quite a long time to save die area.

thesz3y ago

The slicing was introduced to save die area. Not to slice is to have slightly smaller computation delay traded for greater die area.

j / k navigate · click thread line to collapse

0 comments

thesz3y ago

I can be wrong about who did that first, but most FPUs now are done like that.

my1233y ago

On GPUs, they're not sliced like this anymore since quite a long time to save die area.

thesz3y ago

The slicing was introduced to save die area. Not to slice is to have slightly smaller computation delay traded for greater die area.

j / k navigate · click thread line to collapse