The thing that nobody talks about is that there is a high rate of failures on this high end equipment. I've heard as high as 20%, in the first month. I'm not even talking about AMD here.
If anyone thinks they can just buy some accelerators and throw them into a rack and expect them to work flawlessly... they've got some hard lessons to learn.
This will be less of an issue as we grow as we will have plenty of stock to pull from, but it is a real bummer as we are starting as a proof of concept first. We started working on this business last August, before anyone knew whether or not AMD would even change course on AI.
The good news is that we onboarded a customer the day that we announced our availability, we passed that PoC challenge with flying colors and closed significant additional funding immediately after that. Onwards and upwards, just have to roll with the punches.
And I'm not even talking 100/400G network, wonderful wonderful hardware, good luck debugging and getting all the RoCE/RDMA/GPUDirect/StorageDirect/NCCL working (already a bit of pain on nvidia, with a large installed base...).
Either you want to learn all this stuff (for reasons) or you're dumping a lot of money on fast-evolving tech.