undefined | Better HN

0 pointsDTolm5y ago0 comments

I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.

0 comments

5 comments · 2 top-level

datenwolf5y ago· 3 in thread

The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).

DTolmOP5y ago

Small FFTs like 2048 only utilize one SM and the way they are given to the GPU may produce some fluctuations. It also depends on the way your code works. Synchronizations are also more impactful in this case. Do you launch a big grid that consists of multiple samples combined in a matrix or you launch each sample separately?

datenwolf5y ago

I'm aware of all of that. And yes, we're very synchronization dependent. However we also spent a lot of time tinkering with the launch parameter and properly interleaving all synchronization events and fences due to our demands on achieving low latency.

Find our original publication here: https://doi.org/10.1364/BOE.5.002963

Since then we improved on that. For the resampling and complex tonemapping we determined empirically that a grid of 128 threads, each processing a whole line achieves the best throughput; there's a 2D parameter space of possible launch configurations and we brute force the whole thing (so far I didn't benchmark the RTX20xx and RTX30xx GPUs, but it was consistent between the GTX690 to GTX1080). The FFT plan is what cufftPlan1d is producing for a single axis transform over a 2D array, usually 2048 point FFT, but with up to 4096 lines (well, technically whatever the maximum dimension for 3D textures is).

> Do you launch a big grid that consists of multiple samples combined in a matrix

Of course!

> or you launch each sample separately?

Of course not, that'd be stupid.

1 more reply

llukas5y ago

Please check out cuFFTDx - you may be able to fuse parts of your pipeline on-chip.

eximius5y ago

If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.

j / k navigate · click thread line to collapse