Stable Diffusion on AMD RDNA3 (opens in new tab)

(nod.ai)

158 pointstomtomlapomme3y ago68 comments

68 comments

30 comments · 8 top-level

AstixAndBelix3y ago· 7 in thread

Does anyone know what's the current state of AMD's tools to migrate from CUDA? There's so much untapped potential with these cards, it's crazy that basically only gamers can make use of their competitive prices

marcyb5st3y ago

Last time I seriously checked (6 months ago or so) ROCm was still a far cry from CUDA. Set up was a mess, support was hit and miss, some operations were not particularly performante compared to the CUDA counterparts. Additionally, there are Tensorflow and probably PyTorch forks that should work with it, but they lag behind the official repositories quite a bit.

I hope that now that generative AI is becoming mainstream AMD steps up their game both on their consumer and professional lineups. If I were to buy a video card right now ( mostly for gaming+ML hobbies projects + running stable diffusion) I wouldn't pick AMD because I could do just 1/3 of my use cases properly without headaches (gaming).

rowanG0773y ago

OpenCL works pretty well. Can't say I notice large gaps of performance between CUDA and openCL for my hpc work.

2 more replies

epmaybe3y ago

I don’t think there’s truly a competitor but opencl is the alternative to shoot for. Otherwise for machine learning purposes amd helps develop ROCm.

pjmlp3y ago

OpenCL is hardly an alternative, plain old C, using compilation from source at runtime, with very basic tooling available.

Versus a polyglot compiler infrastructure, IDE tooling that includes shader debugging, and a rich ecosytem of GPU based libraries.

Even with SYSCL and SPIR-V, that has hardly improved, and while Intel bases oneAPI on top of SYSCL, that naturally also goes beyond the standard.

3 more replies

CodeArtisan3y ago

The performances on Blender3d are atrocious, the RX 7900 XTX is noticeably slower than a RTX 3060.

ColonelPhantom3y ago

A big part of the reason is that Blender on Nvidia supports hardware accelerated ray tracing using OptiX. HIP-RT exists, but is not used in Blender yet. I think the Intel oneAPI backend for Arc GPUs also misses RT acceleration.

AMD claims to have HIP-RT working internally, but not yet suitable for posting publically. Intel is planning it, I think. Both should land around Blender 3.6, if I'm not mistaken.

If you take the raw FLOPS, CUDA (not OptiX) and HIP are actually nearly equivalent in performance last I remember. I think RDNA2 just does "more with less", at least in terms of gaming performance per FLOP (e.g. due to the huge cache).

snvzz3y ago

Latest Blender release does not have the optimization work in yet.

AIUI, what's in current git master is very different.

lalaland11253y ago· 7 in thread

I really wish more GPU libraries had focused on vulkan instead of CUDA ...

rowanG0773y ago

It's one of the reasons Nvidia is basically untouchable at this time. The AI field willingly enslaved itself to NVidia.

my1233y ago

It's because NVIDIA actually cared and AMD does not where it matters (customer HW).

Openness is totally secondary to functional. You know, the same kind of reasons as of why Linux on the desktop is not a mass market thing compared to Windows for a very long time.

1 more reply

AceJohnny23y ago

CUDA predates Vulkan by over 8 years.

There's a lot of established ecosystem for CUDA, thanks to Nvidia's investment.

andy_ppp3y ago

I thought Vulkan was a graphics specific layer and CUDA was specifically for machine learning?

mattnewton3y ago

CUDA is general purpose compute, but nvidia also releases cudnn which all the major libraries use because it is fast and good (if a little complex). There’s efforts underway to have a comparable library on open source general compute packages but none as mature or effective as cudnn so people just pay nvidia to use that in practice, which lets them invest even more in pulling ahead.

As an aside, I’ve been kinda surprised that this has existed for as long as it has, but I am probably biased and think Ml acceleration is more important than most large business do today.

lalaland11253y ago

Vulkan is designed for all GPU needs, from rendering to general purpose compute.

CUDA is only for general purpose compute.

mhh__3y ago

CUDA is for GPGPU (general purpose GPU) which includes machine learning.

Vulkan is a primarily for graphics but does have options for GPGPU too. Vulkan is however not like OpenGL in that it's fairly close to the hardware in terms of abstraction.

1 more reply

imhoguy3y ago· 5 in thread

Any chance to get SD running on mobile Ryzen APU e.g. Ryzen Pro 4750U (Renoir)?

delijati3y ago

Short answer no. Long answer "in theory" yes. I tried this [1] but gave up as building rocm + deps takes up to 6h :/ Official statement [2]

[1] https://github.com/xuhuisheng/rocm-build [2] https://github.com/RadeonOpenCompute/ROCm/issues/1587

nicolaslem3y ago

For anyone on Arch, there is a third-party repository called arch4edu[0] that provides up to date builds of ROCm and its dependencies. On my iGPU, OpenCL sometimes works, sometimes crashes. Even finding a list of supported hardware is close to impossible. The whole situation is just ridiculous and makes AMD look bad.

[0] https://github.com/arch4edu/arch4edu

1 more reply

magic_at_nodai3y ago

Can you give SHARK a try and let us know on our discord? We can try to help. People have been using it on older AMD GPUs back to Polaris arch.

sosborn3y ago

This worked for me with a 5700xt:

https://www.travelneil.com/stable-diffusion-windows-amd.html

negativegate3y ago

nod-ai/SHARK from the original submission is by far the fastest way I've found to run Stable Diffusion on a 5700 XT.

For 50 iterations:

* ONNX on Windows was 4-5 minutes

* ROCm on Arch Linux was ~2.5 minutes

* SHARK on Windows is ~30 seconds

1 more reply

ggerganov3y ago· 2 in thread

> There has also been a wide variety of accuracy-degrading performance optimizations like Xformers and Flash Attention, which are great tools if you are open to trading accuracy for performance ..

I wasn't aware that Flash Attention trades accuracy for performance. Either I have a wrong understanding of what FA is, or this statement is not fully accurate.

Either way - looks like great work

marcyb5st3y ago

From the flash attention paper:

We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

So I assume they are using the approximate version as they also have an exact version.

ggerganov3y ago

Thanks for that - I have missed the block-sparse extension of the algorithm when I first read about it. And indeed this seems to be what the author means.

tehsauce3y ago· 1 in thread

“There has also been a wide variety of accuracy-degrading performance optimizations like Xformers and Flash Attention, which are great tools if you are open to trading accuracy for performance”

This is incorrect. Those optimizations do identical computations, but leverage memory bandwidth on the gpu more effectively. So there is no accuracy tradeoff there.

magic_at_nodai3y ago

Here are a list of potential issues https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

That said we (Nod.ai team) will add support for xformers soon so you can opt in for xformers anyway.

chem833y ago

> SHARK is an open source cross platform (Windows, macOS and Linux) Machine Learning Distribution packaged with torch-mlir (for seamless PyTorch integration), LLVM/MLIR for re-targetable compiler technologies along with IREE (for efficient codegen, compilation and runtime) and Nod.ai’s tuning. IREE is part of the OpenXLA Project

Google has been doing a good job advancing the IREE ML compiler project, which I think is what will bring other hw platforms like AMD and Intel to the ML game. Industry only has to benefit from increased hardware portability.

thunkshift13y ago

Can someone explain what exactly does nod.ai do? Its not clear at all from their page

brokenmachine3y ago

Can anyone point me to some examples of what I, as a techie, might want to actually use AI for? Some simple hobby projects?

j / k navigate · click thread line to collapse

68 comments

30 comments · 8 top-level

AstixAndBelix3y ago· 7 in thread

marcyb5st3y ago

rowanG0773y ago

OpenCL works pretty well. Can't say I notice large gaps of performance between CUDA and openCL for my hpc work.

2 more replies

epmaybe3y ago

I don’t think there’s truly a competitor but opencl is the alternative to shoot for. Otherwise for machine learning purposes amd helps develop ROCm.

pjmlp3y ago

OpenCL is hardly an alternative, plain old C, using compilation from source at runtime, with very basic tooling available.

Versus a polyglot compiler infrastructure, IDE tooling that includes shader debugging, and a rich ecosytem of GPU based libraries.

Even with SYSCL and SPIR-V, that has hardly improved, and while Intel bases oneAPI on top of SYSCL, that naturally also goes beyond the standard.

3 more replies

CodeArtisan3y ago

The performances on Blender3d are atrocious, the RX 7900 XTX is noticeably slower than a RTX 3060.

ColonelPhantom3y ago

AMD claims to have HIP-RT working internally, but not yet suitable for posting publically. Intel is planning it, I think. Both should land around Blender 3.6, if I'm not mistaken.

snvzz3y ago

Latest Blender release does not have the optimization work in yet.

AIUI, what's in current git master is very different.

lalaland11253y ago· 7 in thread

I really wish more GPU libraries had focused on vulkan instead of CUDA ...

rowanG0773y ago

It's one of the reasons Nvidia is basically untouchable at this time. The AI field willingly enslaved itself to NVidia.

my1233y ago

It's because NVIDIA actually cared and AMD does not where it matters (customer HW).

Openness is totally secondary to functional. You know, the same kind of reasons as of why Linux on the desktop is not a mass market thing compared to Windows for a very long time.

1 more reply

AceJohnny23y ago

CUDA predates Vulkan by over 8 years.

There's a lot of established ecosystem for CUDA, thanks to Nvidia's investment.

andy_ppp3y ago

I thought Vulkan was a graphics specific layer and CUDA was specifically for machine learning?

mattnewton3y ago

As an aside, I’ve been kinda surprised that this has existed for as long as it has, but I am probably biased and think Ml acceleration is more important than most large business do today.

lalaland11253y ago

Vulkan is designed for all GPU needs, from rendering to general purpose compute.

CUDA is only for general purpose compute.

mhh__3y ago

CUDA is for GPGPU (general purpose GPU) which includes machine learning.

Vulkan is a primarily for graphics but does have options for GPGPU too. Vulkan is however not like OpenGL in that it's fairly close to the hardware in terms of abstraction.

1 more reply

imhoguy3y ago· 5 in thread

Any chance to get SD running on mobile Ryzen APU e.g. Ryzen Pro 4750U (Renoir)?

delijati3y ago

Short answer no. Long answer "in theory" yes. I tried this [1] but gave up as building rocm + deps takes up to 6h :/ Official statement [2]

[1] https://github.com/xuhuisheng/rocm-build [2] https://github.com/RadeonOpenCompute/ROCm/issues/1587

nicolaslem3y ago

[0] https://github.com/arch4edu/arch4edu

1 more reply

magic_at_nodai3y ago

Can you give SHARK a try and let us know on our discord? We can try to help. People have been using it on older AMD GPUs back to Polaris arch.

sosborn3y ago

This worked for me with a 5700xt:

https://www.travelneil.com/stable-diffusion-windows-amd.html

negativegate3y ago

nod-ai/SHARK from the original submission is by far the fastest way I've found to run Stable Diffusion on a 5700 XT.

For 50 iterations:

* ONNX on Windows was 4-5 minutes

* ROCm on Arch Linux was ~2.5 minutes

* SHARK on Windows is ~30 seconds

1 more reply

ggerganov3y ago· 2 in thread

> There has also been a wide variety of accuracy-degrading performance optimizations like Xformers and Flash Attention, which are great tools if you are open to trading accuracy for performance ..

I wasn't aware that Flash Attention trades accuracy for performance. Either I have a wrong understanding of what FA is, or this statement is not fully accurate.

Either way - looks like great work

marcyb5st3y ago

From the flash attention paper:

We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

So I assume they are using the approximate version as they also have an exact version.

ggerganov3y ago

Thanks for that - I have missed the block-sparse extension of the algorithm when I first read about it. And indeed this seems to be what the author means.

tehsauce3y ago· 1 in thread

“There has also been a wide variety of accuracy-degrading performance optimizations like Xformers and Flash Attention, which are great tools if you are open to trading accuracy for performance”

This is incorrect. Those optimizations do identical computations, but leverage memory bandwidth on the gpu more effectively. So there is no accuracy tradeoff there.

magic_at_nodai3y ago

Here are a list of potential issues https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

That said we (Nod.ai team) will add support for xformers soon so you can opt in for xformers anyway.

chem833y ago

thunkshift13y ago

Can someone explain what exactly does nod.ai do? Its not clear at all from their page

brokenmachine3y ago

Can anyone point me to some examples of what I, as a techie, might want to actually use AI for? Some simple hobby projects?

j / k navigate · click thread line to collapse