I agree about the programming models not being parallel by default, and that's one of the things that I specifically rail against in most of my comments. MATLAB/Octave is a good introduction to what parallel programming could be. Also the endless doubling down on large caches, because the multicore design I have in mind would mostly eliminate cache and use that die area for cores and local memories.
I think we're slightly talking past each other here though. The CPU I want to build would have around 10-256 cores on 90s tech. So the same transistors holding 1 Pentium Pro would allow for 1-2 orders of magnitude more MIPS or RISC-V cores and local memories. The design is so simple that I think that's why it was missed by the big fabs.
Today there's little demand for 1000+ cores, but that's partly because nobody can see what they could do. But we can't design the thing, because the status quo has us all working pedal to the metal in first gear to make rent. It's a chicken and egg problem that has a lower likelihood of being solved as time goes on. Which is why I think we're on the wrong timeline, because if the system worked then actual innovation would become more accessible over time.