However, before dismissing this all as a bad mapping to an outdated 1970s model of computation, I'd like to see a good alternative. CUDA has clearly shown that there's an acceptable model for massively parallel data sets, but that doesn't handle branch heavy code very well at all. And FPGAs have a different approach for a completely different kind of problem, but I don't know how you would expose what Apple, AMD, or Intel chips are doing under the hood and have it be at all manageable to the programmer. How is someone supposed to indicate what's next when a pipeline stalls waiting on the previous operation or a cache miss? Is the programmer going to toss micro ops into separate execution units and wait for the results to come out the other side in arbitrary order? Is this an async/await model for every addition or memory fetch? I think it would be complete spaghetti to even try, but I'd love to be shown I'm wrong.
People get all excited trash talking Itanium, but I think it's a lesson that if you try to expose any alternative to the 1970s model they'll just bitch about how there are no sufficiently smart compilers. And of course it got scooped by AMD64 pretending to execute one instruction after another.
And if there isn't a good alternative, I think C (or Rust, or WASM) are a pretty good fit for what you've actually got to work with at the low level.