in the time domain, you're looking at what... for each sample a pointwise multiply for each tap and then a sum, right? i'm guessing for most audio applications at commonly used sample rates, you're rarely going to have more than 24 taps at the very most? (with most only really needed 8 to 16?)
with the fft, unless you're using super exotic multiwindow schemes, you're looking at a pointwise multiply just to do the windowing before the fft. then you're looking at n log n to compute the fft itself, then zeroing or applying an envelope (pointwise multiply), then another n log n back to the frequency domain.
i think with simd you're way faster to just stay in the time domain.
would be interesting to bench for sure though... small ffts and simd may be super fast and not that many instructions.