Nvidia has shown time and time again, they will royally fuck over anyone they have to in order to drive profit.
Then, if they dare talk negatively about them, they will just discontinue their access to hardware.
Not saying AMD is a savior, but having only ONE option will lead to long term issues.
I do hope that AMD gets their shit in order because we need the competition to keep this space energized.
Yup - which is exactly what is going on in the cloud space right now.
Because AWS and GCP chose to innovate with their own accelerators, Nvidia heavily favoured Azure for a while. Recently, GCP seem to have capitulated somehow and so are back on the bandwagon. Oracle, of course, never had any hope of success in cloud without leaning on some form of non-technical manipulation, which is why they were the first on board with DGX Cloud.
Sadly I don't see AMD as the solution, since they too have associated themselves more with Azure than the other clouds.
AWS and GCP work on competitors to Nvidia's products, so Nvidia favors Azure who is not doing that, and this is somehow Nvidia's fault or even a problem?
Looks more like Nvidia was hedging its bets in case AWS or GCP succeeded at developing competitive AI chips and then transitioned completely away from Nvidia.
The dependencies are such a mess that even if you try to install only pytorch-cpu, at some point some random package will cause pytorch-cuda and those 10GBs to be installed.
Sure after the building the binary is HUGE. But I only have to build it once and cache it so that all my workstations and training servers can use it.
I don't known, as I'm only just now building out my first AMD based ML machine to run ROCm. All I can really say is that AMD really seem to be making a genuine effort to get ROCm to that level. See the two links I submitted yesterday[1][2] for more details.
The two things in particular that stand out to me from all this are:
1. They are at least publicly declaring their intention to make ROCm a player in AI/ML. Previously there was at least a perception (and quite possibly a reality) that ROCm was more focused on other HPC workloads and not really AI / ML. AMD seems committed to changing that.
2. It seems that they are finally serious about getting ROCm working on their consumer Radeon cards. Even though 5.6 didn't include the long hoped-for announcement of such support, the blog post they put out did at least officially declare their intent to do so in a release this fall. And maybe more to the point, the batch of changes in 5.6 did actually include some fixes for problems encountered running on Radeon cards, even though they aren't yet officially listed as supported.
If you're looking for something on AMD consumer cards...then you have to keep waiting.