Ok, I was able to figure out a few more things. First of all, the benchmark runner does not necessarily exploit the parallelism of GOMAXPROCS out of the box. Second, GOMAXPROCS seems to default to 1x logical cores; I last ran it on a machine where logical cores = 2x physical cores. I adjusted the benchmark code to use RunParallel and adjusted the parallelism of each run.
Testing on Celeron J3455 @ 1.5 GHz (4 physical and logical cores) gave me PCG at 1.2 cpb and ChaCha8 at 2.6 cpb with cpu=1, but PCG stayed relatively constant across cpu=1,2,4,8 (worst was 1.8 cpb) while ChaCha8 slowed to 6.5 cpb at cpu=2 and 7.5 cpb at cpu=4 and cpu=8.
Back on my M1 Mac (8 logical and physical? cores), both ChaCha8 and PCG generally got better with more cores. ChaCha8 got down to 0.76 cpb at cpu=4 (then regressed a bit at cpu=8) while PCG got down to 0.26 cpb at cpu=8.
I don't think any of these results rule ChaCha8 out completely, though again I'm looking from the perspective of video games, which generally monopolize a machine while running.