Not even from cloud providers which should be telling enough.
Rocm is sensitive to matching kernel version to driver version to userspace version. Staying very much on the kernel version from a official release and using the corresponding driver is drastically more robust than optimistically mixing different components. In particular, rocm is released and tested as one large blob, and running that large blob on a slightly different kernel version can go very badly. Mixing things from GitHub with things from your package manager is also optimistic.
Imagine it as huge ball of code where cross version compatibility of pieces is totally untested.
If you can point me to a repro I'll add it to my todo list. You can probably tag me in the github issue if that's where you reported it.
George has a nice explanation of doorbells:
AMD's driver is in your kernel, all the userspace is on GitHub. The ISA is documented. It's entirely possible to treat the ASICs as mass market subsidized floating point machines and run your own code on them.
Modulo firmware. I'm vaguely on the path to working out what's going on there. Changing that without talking to the hardware guys in real time might be rather difficult even with the code available though.
That said, if you're on the inside, like I am, and you talk to people at AMD (just got off two separate back to back calls with them), rest assured, they are dedicated to making this stuff work.
Part of that is to build a developer flywheel by making their top end hardware available to end users. That's where my company Hot Aisle comes into play. Something that wasn't available before outside of the HPC markets, is now going to be made available.
Even Nvidia GPU's are tricky to sandbox, and it sounds like the AMD cards are really easy for the tenant to break (or at least force a restart of the underlying host).
AWS does have a Gaudi instance which is interesting, but overall I don't see why Azure, AWS & Google would deploy AMD or Intel GPU's at scale vs their own chips.
They need some competitor to Nvidia to help negotiate, but if its going to be a painful software support story suited to only a few enterprise customers, why not do it with your own chip?