I'm aware of all of that. And yes, we're very synchronization dependent. However we also spent
a lot of time tinkering with the launch parameter and properly interleaving all synchronization events and fences due to our demands on achieving low latency.
Find our original publication here: https://doi.org/10.1364/BOE.5.002963
Since then we improved on that. For the resampling and complex tonemapping we determined empirically that a grid of 128 threads, each processing a whole line achieves the best throughput; there's a 2D parameter space of possible launch configurations and we brute force the whole thing (so far I didn't benchmark the RTX20xx and RTX30xx GPUs, but it was consistent between the GTX690 to GTX1080). The FFT plan is what cufftPlan1d is producing for a single axis transform over a 2D array, usually 2048 point FFT, but with up to 4096 lines (well, technically whatever the maximum dimension for 3D textures is).
> Do you launch a big grid that consists of multiple samples combined in a matrix
Of course!
> or you launch each sample separately?
Of course not, that'd be stupid.