I’ve been writing parallel code at the largest scales most of my career. The state-of-the-art architectures are all, effectively, single-threaded with latency-hiding. This model has a lot of mechanical sympathy with real silicon which is why it is used. It is also pleasantly simple in practice.
It works pretty well for ordinary heavy crunch code too. I originally designed code like this on supercomputers. You do need a practical mechanism for efficiently decomposing loads at a fine granularity but you rarely see applications that don’t have this property that are also compute bound.
I've (badly) written code like you describe -- usually by abusing fork to do my initial data structure setup, then using C pipes to pass jobs around. I suspect there are much better ways of doing it, but that parallelised well enough for the stuff I was doing. I'd be interested to know if there are good libraries (or best practices) for doing this kind of parallelism.
Latency hiding is a technique where 1) most workloads are trivially decomposed into independent components that can be executed separately and 2) you can infer or directly determine when any particular operation will stall. There are many ways to execute these high latency operations in an asynchronous and non-blocking way such that you can immediate work on some other part of the workload. The “latency-hiding” part is that the CPU is rarely stalled, always switching to a part of the workload that is immediately runnable if possible so that the CPU is never stalled and always doing real, constructive work. Latency-hiding optimizes for throughput, maximizing utilization of the CPU, but potentially increasing the latency of specific sub-operations by virtue of reordering the execution schedule to “hide” the latency of operations that would stall the processor. For many workloads, the latency of the sub-operations doesn’t matter, only the throughput of the total operation. The real advantage of latency-hiding architectures is that you can approach the theoretical IPC of the silicon in real software.
There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing. Cray/Tera MTA architecture is probably the canonical example as well as the original Xeon Phi. As a class, latency-hiding CPU architectures are sometimes referred to as “barrel processors”. In the case of Cray MTA, the CPU can track 128 separate threads of execution in hardware and automatically switch to a thread that is immediately runnable at each clock cycle. Thread coordination is effectively “free”. In software, switching between logical threads of execution is much more by inference but often sees huge gains in throughput. The only caveat is that you can’t ignore tail latencies in the design — a theoretically optimal latency-hiding architecture may defer execution of an operation indefinitely.
I don’t know much about supercomputers, but what you described is precisely how all modern GPUs deal with VRAM latency. Each core runs multiple threads, the count is bound by resources used: the more registers and group shared memory a shader uses, the less threads of the shader can be scheduled on the same core. The GPU then switches threads instead of waiting for that latency.
That’s how GPUs can saturate their RAM bandwidth, which exceeds 500 GB/second in modern high-end GPUs.
"Threads" are nothing but kernel-mode coroutines with purpose-built schedulers in the kernel.
Redoing the same machinery except in usermode is not the way to get performance.
The problem is that scripting languages don't let you access the kernel API's cleanly due to various braindead design decisions - global locks in the garbage collector, etc.
But the solution isn't to rewrite the kernel in every scripting language, the solution is to learn to make scripting languages that aren't braindead.