AMD AI Software Solved – MI300X Pricing, Perf, PyTorch, FlashAttention, Triton (opens in new tab)

(semianalysis.com)

60 pointsanother2y ago23 comments

23 comments

bearjaws2y ago

Honestly, the likes of OpenAI and Mosaic need to consider Nvidia a huge threat long term.

Nvidia has shown time and time again, they will royally fuck over anyone they have to in order to drive profit.

Then, if they dare talk negatively about them, they will just discontinue their access to hardware.

Not saying AMD is a savior, but having only ONE option will lead to long term issues.

whalesalad2y ago

This is a very real take. Large institutions tend to have teams dedicated to future proofing and mitigating these risks, i'd hope and imagine it is on their radar.

I do hope that AMD gets their shit in order because we need the competition to keep this space energized.

anyoneamous2y ago

> they will just discontinue their access to hardware

Yup - which is exactly what is going on in the cloud space right now.

Because AWS and GCP chose to innovate with their own accelerators, Nvidia heavily favoured Azure for a while. Recently, GCP seem to have capitulated somehow and so are back on the bandwagon. Oracle, of course, never had any hope of success in cloud without leaning on some form of non-technical manipulation, which is why they were the first on board with DGX Cloud.

Sadly I don't see AMD as the solution, since they too have associated themselves more with Azure than the other clouds.

SkyMarshal2y ago

> Because AWS and GCP chose to innovate with their own accelerators, Nvidia heavily favoured Azure for a while.

AWS and GCP work on competitors to Nvidia's products, so Nvidia favors Azure who is not doing that, and this is somehow Nvidia's fault or even a problem?

Looks more like Nvidia was hedging its bets in case AWS or GCP succeeded at developing competitive AI chips and then transitioned completely away from Nvidia.

1 more reply

toxicFork2y ago

GPT4 please port my CUDA to work on AMD

pokeypokes2y ago

A non-trivial amount of effort has gone (and is going) into this, see the Hipify tool.

FL33TW00D2y ago

Unfortunately, this does not work. I've tried.

pizza2y ago

Somewhat related: starting a new Ubuntu-pytorch-cuda project? That’ll be 10-15 gigabytes, please. Is there some way to strip down the individual deps? I imagine it’s the way that it is because the drivers and the pytorch source are probably machine generated before compilation, to some extent. Is there any hope for triton becoming a lightweight interstitial frame in the codebase stack that can just allow me to do the hw codegen after pulling, without having to rebuild completely everything, and also simultaneously allowing more accelerators?

kramerger2y ago

Here is what really annoys me:

The dependencies are such a mess that even if you try to install only pytorch-cpu, at some point some random package will cause pytorch-cuda and those 10GBs to be installed.

breakds2y ago

I use Nix to manage my machine learning development environment: https://github.com/nixvital/ml-pkgs

Sure after the building the binary is HUGE. But I only have to build it once and cache it so that all my workstations and training servers can use it.

reaperman2y ago

Is ROCm actually usable in this years machine learning ecosystem? Can I just drop in any PyTorch model that was developed on CUDA and expect it to work?

mindcrime2y ago

Is ROCm actually usable in this years machine learning ecosystem?

I don't known, as I'm only just now building out my first AMD based ML machine to run ROCm. All I can really say is that AMD really seem to be making a genuine effort to get ROCm to that level. See the two links I submitted yesterday[1][2] for more details.

The two things in particular that stand out to me from all this are:

1. They are at least publicly declaring their intention to make ROCm a player in AI/ML. Previously there was at least a perception (and quite possibly a reality) that ROCm was more focused on other HPC workloads and not really AI / ML. AMD seems committed to changing that.

2. It seems that they are finally serious about getting ROCm working on their consumer Radeon cards. Even though 5.6 didn't include the long hoped-for announcement of such support, the blog post they put out did at least officially declare their intent to do so in a release this fall. And maybe more to the point, the batch of changes in 5.6 did actually include some fixes for problems encountered running on Radeon cards, even though they aren't yet officially listed as supported.

[1]: https://news.ycombinator.com/item?id=36522683

[2]: https://news.ycombinator.com/item?id=36522876

nairboon2y ago

On my projects it kind of works. Given the usual driver installing work. Depending on the card you'd have to use some env flags like HSA_OVERRIDE_GFX_VERSION to make it not crash.

gyrovagueGeist2y ago

On an MI250+ system or other similar architectures that mirror what El-Capitan is going to look like, ROCm is stable and there are pytorch + cupy backends for it. It mostly just works. If you have custom kernels as part of your pipeline you'd need to convert them from CUDA to HIP though.

If you're looking for something on AMD consumer cards...then you have to keep waiting.

synergy202y ago

obviously not there yet, otherwise AMD stock will be doubled.

xrd2y ago

A recent latent.space podcast with geohot discusses a lot of related information; his approach to using AMD and the challenges of getting ML to work on anything other than NVidia.

https://www.latent.space/p/geohot#details

stuaxo2y ago

I've been waiting years and will wait more for PyTorch etc to work under Mesa rather than install RocM

synergy202y ago

Mesa is a 3D lib for GPU rendering, does it do compute like what Nvidia's GPU does? we're talking about Mesa has its own small kernels that run matrix muts in parallel at scale. As far as I can tell, Mesa at the moment is not going to work for any ML framework like pytorch etc.

mindcrime2y ago

Mesa advertises support for OpenCL[1], so the idea of using it as an ML backend isn't ridiculous. But I can't speak to whether or not anybody has actually tried to make that work, or where it stands.

[1]: https://www.khronos.org/opencl/

1 more reply

avipeltz2y ago

Is this real? is ROCm not horrible to use now? Anybody using it rn, if so what card and how is it working?

j / k navigate · click thread line to collapse

23 comments

bearjaws2y ago

Honestly, the likes of OpenAI and Mosaic need to consider Nvidia a huge threat long term.

Nvidia has shown time and time again, they will royally fuck over anyone they have to in order to drive profit.

Then, if they dare talk negatively about them, they will just discontinue their access to hardware.

Not saying AMD is a savior, but having only ONE option will lead to long term issues.

whalesalad2y ago

This is a very real take. Large institutions tend to have teams dedicated to future proofing and mitigating these risks, i'd hope and imagine it is on their radar.

I do hope that AMD gets their shit in order because we need the competition to keep this space energized.

anyoneamous2y ago

> they will just discontinue their access to hardware

Yup - which is exactly what is going on in the cloud space right now.

Sadly I don't see AMD as the solution, since they too have associated themselves more with Azure than the other clouds.

SkyMarshal2y ago

> Because AWS and GCP chose to innovate with their own accelerators, Nvidia heavily favoured Azure for a while.

AWS and GCP work on competitors to Nvidia's products, so Nvidia favors Azure who is not doing that, and this is somehow Nvidia's fault or even a problem?

Looks more like Nvidia was hedging its bets in case AWS or GCP succeeded at developing competitive AI chips and then transitioned completely away from Nvidia.

1 more reply

toxicFork2y ago

GPT4 please port my CUDA to work on AMD

pokeypokes2y ago

A non-trivial amount of effort has gone (and is going) into this, see the Hipify tool.

FL33TW00D2y ago

Unfortunately, this does not work. I've tried.

pizza2y ago

kramerger2y ago

Here is what really annoys me:

The dependencies are such a mess that even if you try to install only pytorch-cpu, at some point some random package will cause pytorch-cuda and those 10GBs to be installed.

breakds2y ago

I use Nix to manage my machine learning development environment: https://github.com/nixvital/ml-pkgs

Sure after the building the binary is HUGE. But I only have to build it once and cache it so that all my workstations and training servers can use it.

reaperman2y ago

Is ROCm actually usable in this years machine learning ecosystem? Can I just drop in any PyTorch model that was developed on CUDA and expect it to work?

mindcrime2y ago

Is ROCm actually usable in this years machine learning ecosystem?

The two things in particular that stand out to me from all this are:

[1]: https://news.ycombinator.com/item?id=36522683

[2]: https://news.ycombinator.com/item?id=36522876

nairboon2y ago

On my projects it kind of works. Given the usual driver installing work. Depending on the card you'd have to use some env flags like HSA_OVERRIDE_GFX_VERSION to make it not crash.

gyrovagueGeist2y ago

If you're looking for something on AMD consumer cards...then you have to keep waiting.

synergy202y ago

obviously not there yet, otherwise AMD stock will be doubled.

xrd2y ago

A recent latent.space podcast with geohot discusses a lot of related information; his approach to using AMD and the challenges of getting ML to work on anything other than NVidia.

https://www.latent.space/p/geohot#details

stuaxo2y ago

I've been waiting years and will wait more for PyTorch etc to work under Mesa rather than install RocM

synergy202y ago

mindcrime2y ago

Mesa advertises support for OpenCL[1], so the idea of using it as an ML backend isn't ridiculous. But I can't speak to whether or not anybody has actually tried to make that work, or where it stands.

[1]: https://www.khronos.org/opencl/

1 more reply

avipeltz2y ago

Is this real? is ROCm not horrible to use now? Anybody using it rn, if so what card and how is it working?

j / k navigate · click thread line to collapse