To give a feel, while at Berkeley, we had an award-winning grad student working on autotuning CUDA kernels and empirically figuring out what does / doesn't work well on some GPUs. Nvidia engineers would come to him to learn about how their hardware and code works together for surprisingly basic scenarios.
It's difficult to write great CUDA code because it needs to excel in multiple specializations at the same time:
* It's not just writing fast low-level code, but knowing which algorithmic code to do. So you or your code reviewer needs to be an expert at algorithms. Worse, those algorithms are both high-level, and unknown to most programmers, also specific to hardware models, think scenarios like NUMA-aware data parallel algorithms for irregular computations. The math is generally non-traditional too, e.g., esoteric matrix tricks to manipulate sparsity and numerical stability.
* You ideally will write for 1 or more generations of architectures. And each architecture changes all sorts of basic constants around memory/thread/etc counts at multiple layers of the architecture. If you're good, you also add some sort of autotuning & JIT layers around that to adjust for different generations, models, and inputs.
* This stuff needs to compose. Most folks are good at algorithms, software engineering, or performance... not all three at the same time. Doing this for parallel/concurrent code is one of the hardest areas of computer science. Ex: Maintaining determinism, thinking through memory life cycles, enabling async vs sync frameworks to call it, handling multitenancy, ... . In practice, resiliency in CUDA land is ~non-existent. Overall, while there are cool projects, the Rust etc revolution hasn't happened here yet, so systems & software engineering still feels like early unix & c++ vs what we know is possible.
* AI has made it even more interesting nowadays. The types of processing on GPUs are richer now, multi+many GPU is much more of a thing, and disk IO as well. For big national lab and genAI foundation model level work, you also have to think about many racks of GPUs, not just a few nodes. While there's more tooling, the problem space is harder.
This is very hard to build for. Our solution early on was figuring out how to raise the abstraction level so we didn't have to. In our case, we figured out how to write ~all our code as operations over dataframes that we compiled down to OpenCL/CUDA, and Nvidia thankfully picked that up with what became RAPIDS.AI. Maybe more familiar to the HN crowd, it's basically the precursor and GPU / high-performance / energy-efficient / low-latency version of what the duckdb folks recently began on the (easier) CPU side for columnar analytics.
It's hard to do all that kind of optimization, so IMO it's a bad idea for most AI/ML/etc teams to do it. At this point, it takes a company at the scale of Nvidia to properly invest in optimizing this kind of stack, and software developers should use higher-level abstractions, whether pytorch, rapids, or something else. Having lived building & using these systems for 15 years, and worked with most of the companies involved, I haven't put any of my investment dollars into AMD nor Intel due to the revolving door of poor software culture.
Chip startups also have funny hubris here, where they know they need to try, but end up having hardware people run the show and fail at it. I think it's a bit different this time around b/c many can focus just on AI inferencing, and that doesn't need as much what the above is about, at least for current generations.
Edit: If not obvious, much of our code that merits writing with CUDA in mind also merits reading research papers to understand the implications at these different levels. Imagine scheduling that into your agile sprint plan. How many people on your team regularly do that, and in multiple fields beyond whatever simple ICML pytorch layering remix happened last week?