I think you're downplaying the importance of first-class parallelism in Python land. I'd agree if you asserted that "95% of cases where CPU parallelism is actually used" are implemented by "bypassing the interpreter and the GIL entirely in hotpaths", but that's largely out of limitations and not because there was no further need.
There is plenty of Python code written where the authors didn't realise it was going to be CPU-critical or in hotpaths until it was too late. It's a small fraction of these cases where the authors have the fortune/opportunity to subsequently rewrite their code "via celery or the stdlib's multiprocess module" — sharing data in-memory is simply a lot easier than sharing across processes, and avoiding mutation of data structures is easier than writing serialisation code for them.