PTX is not general purpose, it was specifically designed for that architecture, and it incorporates these tradeoffs.
> far more advanced than any SIMD I've seen implemented on a CPU.
Less advanced than normal non-SIMD branching available to CPU cores. A lot of practical algorithms need both SIMD compute and scalar branching.
> That's what the CudaMalloc function does.
CudaMalloc function can't be called from GPU code. You can't allocate RAM at all from inside your algorithm, there's no stack, no heap, nothing. You have to know in advance how much RAM do you need. For some practical problems this is a showstopper, e.g. try to implement unzip in CUDA.
> What do you mean by "limited write access"?
You only have fixed-length arrays/textures. Global memory barriers are very slow. There're producer-consumer buffers but practically speaking they're merely thread safe cursors over statically sized buffer.
GPUs can multiply dense matrices very fast, but many practical problems are different, e.g. for sparse matrices GPUs don't deliver value performance wise. SIMD on CPU is often very useful for such problems, but yes, programming model is different, lower level and more complex. No free lunches.
> Programmers don't want to "think" in SIMD.
I'm a programmer and I like SIMD.
> even though vpshufb is effectively a gather instruction over an AVX register
On GPU it would be because you don't care about latency you just spawn more threads.
On CPU it's not because shuffle_epi8 is 1 cycle latency instruction, and RAM access is much slower, if you'll think they're equivalent you'll miss the performance difference.
> CPUs are missing very, very few assembly instructions before they can run like a GPU
Even if you'll add these few instructions, GPUs will still be much faster, by orders of magnitude. Hardware is too different but it's not the instructions, CPU is spending transistors minimizing latency (caches and their sync, branch prediction, speculative execution, etc.) GPUs don't care about latency.
It's not instruction set that allowed simple programming model on GPUs. It's fundamentally different tradeoffs in hardware.