The last part is particularly important. Most deep learning libs rely on things like cuDNN which implement by hand basic operations very efficiently.
The ROCm initiative by AMD already "transpile" most CUDA kernels to being AMD compatible, but it seems performance is not yet comparable in real word benchmarks.