Couple of notes, I think you forgot to apply timestep when adding rocket exhaust velocity, pretty sure it should be
u[idx] += backward.x * flame_velocity_amount * falloff * delta
v[idx] += backward.y * flame_velocity_amount * falloff * delta
You need to compensate by scaling up flame_velocity_amount, I used 85, seemed about the same.Thank you
In my implementations I use 4th order for both and vortices stick around a lot longer.
So instead of reading 16 grid values and combining them to get the interpolated sample value, you can fetch 4 bilinearly filtered samples and combine those. And thanks to the hardware filtering, those bilinear samples cost basically the same as reading an unfiltered value.
[1]: https://developer.nvidia.com/gpugems/gpugems2/part-iii-high-...
I suppose because the fetches are generally to similar memory regions, there may not be a substantial performance improvement due to L1 and L2 hits on recent GPUs.
That'll perform even worse though, hopefully my CPU can handle it or I'm gonna need a lot of leftover time to make a shader
The 2001 Fedkiw/Stam/Jensen "Visual Simulation of Smoke" paper added it back as a correction force for exactly this reason. At N=16 it doesn't matter much because the grid itself can't represent fine vortices, but the moment you crank N up the missing confinement becomes visible.