In remote sensing | computation physicas applications it's rare to have a single FFT to compute (whatever algorithm is chosen).
Hence the practice of stuffing many FFT's through GPU grids in parallel and working to max out the hardware usage in order to increase application throughput.
What I mean is: where did you take that from? I program FFTs on GPUs, and I see no reason for the "inherently can't reach 100% utilization by any metric".
I interpret that comment as you're not going to be using every silicon block that the GPU provides, like video codecs and rasterizing. If you've maxed out compute without going over the power budget, for example, you'd likely still be able to decode video if the GPU has a separate block for it.