The experience was incredibly simple: write C like usual but annotate a few C functions with some extra keywords and compile using a custom frontend/preprocessor/whatever-nvcc-was instead of gcc (i was on Linux - and BTW i heavily contest the notion that Nvidia drivers on Linux were "nightmare", they always worked just fine with both performance and features comparable to their Windows counterparts while ATi/AMD had buggy and broken drivers for years). Again, the experience was very simple, i even just copy/pasted a bunch of existing C code i had and it worked.
Later i tried to use OpenCL which was supposedly the open alternative. That one felt way more primitive and low level, like writing shaders without the shading bits.
In a way, as you wrote, it was kinda like DirectX: that is, CUDA was like using OpenGL 1.1 with its convenient and straightforward C API and OpenCL was like using DirectX 3 with its COM infested execute buffer nonsense.
After that i never really used CUDA (or OpenCL for that matter) but it gave me the impression that Nvidia did put way more effort on developer experience.
Like, I started using CUDA (through frameworks) over ten years ago, and basically nobody has come up with anything competitive since then.
This is a significant understatement. For quite some time Jensen has been saying repeatedly that 30% of their R&D spend is on software. With the money-printing machine that is Nvidia if that holds they're going to continue to rocket ahead of competitors in terms of delivering actual solutions.
The "What are you talking about? AMD/Intel runs torch just fine!" crowd clearly haven't seen things like RIVA, Deepstream, Nemo, Triton Inference Server/NIM, etc. Meanwhile AMD (ROCm) still struggles with flash attention...
What these hardware-first (only?) companies like AMD don't seem to understand is that people buy solutions, not GPUs. It just so happens that GPUs are the best way to run these kinds of workloads but if you don't have a wholistic and exhaustive overall ecosystem you end up in single digit market share vs Nvidia at ~90%.
Since CUDA 3.0, NVidia has embraced a polyglot stack, with C, C++ and Fortran at the center, and PTX for anyone else.
Followed by changing CUDA memory model to map that of C++11.
Khronos never cared for Fortran, and only designed SPIR, when it became obvious they were too late to the party.
So not only has CUDA first level tooling for C, C++, Fortran, with IDE integration in Visual Studio and Eclipse, graphical GPU debugger with all the goodies of a modern debugger, it also welcomes any compiler toolchain that wants to target PTX.
Java, Haskell, .NET, Julia, Python JITs, .... there are plenty to chose from, without going through "compile to OpenCL C99" alternative.
Finally, the myriad of libraries to chose from.
CUDA is not only for AI, by the way.
And because of that, their OpenCL implementation also works better than others. So there's more tooling not just from nvidia using it, because it. just. works.
Compare this with AMD, whose latest framework is a total mess of "will it work on this GPU?", sometimes needing custom wrangling to enable, etc. etc. and it's effectively supported only on the most expensive compute-only cards.