Brought down some simulations from about 30min to under 1s.
My point in the article was basically the class was "indoctrinating" (too strong, but you get the point) the future ML researchers in the superiority of using CUDA and spending NVIDIA company resources to continuously do so in these classes, year after year.
If you could compile CUDA for Intel and AMD it's not going to perform well. When you program a GPU you aren't just writing task specific code, you are also writing hardware specific code. So having developer mindshare matters much more than having a nice programming language.
In ML many people write pytorch and not CUDA. But even in ML the choice of precision is driven by the data types Nvidia can deal with efficiently - this is a moat which is nothing to do with CUDA.
The world is deeper than just assembly and BLAS tuning, and you can get extremely far in CUDA just by gluing together the primitives they give. Python is popular in the AI/ML space, but far from the only way to do that.