I hope that now that generative AI is becoming mainstream AMD steps up their game both on their consumer and professional lineups. If I were to buy a video card right now ( mostly for gaming+ML hobbies projects + running stable diffusion) I wouldn't pick AMD because I could do just 1/3 of my use cases properly without headaches (gaming).
Versus a polyglot compiler infrastructure, IDE tooling that includes shader debugging, and a rich ecosytem of GPU based libraries.
Even with SYSCL and SPIR-V, that has hardly improved, and while Intel bases oneAPI on top of SYSCL, that naturally also goes beyond the standard.
AMD claims to have HIP-RT working internally, but not yet suitable for posting publically. Intel is planning it, I think. Both should land around Blender 3.6, if I'm not mistaken.
If you take the raw FLOPS, CUDA (not OptiX) and HIP are actually nearly equivalent in performance last I remember. I think RDNA2 just does "more with less", at least in terms of gaming performance per FLOP (e.g. due to the huge cache).
AIUI, what's in current git master is very different.
Openness is totally secondary to functional. You know, the same kind of reasons as of why Linux on the desktop is not a mass market thing compared to Windows for a very long time.
There's a lot of established ecosystem for CUDA, thanks to Nvidia's investment.
As an aside, I’ve been kinda surprised that this has existed for as long as it has, but I am probably biased and think Ml acceleration is more important than most large business do today.
CUDA is only for general purpose compute.
Vulkan is a primarily for graphics but does have options for GPGPU too. Vulkan is however not like OpenGL in that it's fairly close to the hardware in terms of abstraction.
[1] https://github.com/xuhuisheng/rocm-build [2] https://github.com/RadeonOpenCompute/ROCm/issues/1587
https://www.travelneil.com/stable-diffusion-windows-amd.html
For 50 iterations:
* ONNX on Windows was 4-5 minutes
* ROCm on Arch Linux was ~2.5 minutes
* SHARK on Windows is ~30 seconds
I wasn't aware that Flash Attention trades accuracy for performance. Either I have a wrong understanding of what FA is, or this statement is not fully accurate.
Either way - looks like great work
We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.
So I assume they are using the approximate version as they also have an exact version.
This is incorrect. Those optimizations do identical computations, but leverage memory bandwidth on the gpu more effectively. So there is no accuracy tradeoff there.
That said we (Nod.ai team) will add support for xformers soon so you can opt in for xformers anyway.
Google has been doing a good job advancing the IREE ML compiler project, which I think is what will bring other hw platforms like AMD and Intel to the ML game. Industry only has to benefit from increased hardware portability.