If you make N copies of a model, train them independently for a little while on N machines, and average them back together, it sort of works. But not if you train for very long, as the internal structure diverges.
It becomes an empirical engineering question how many parallel nodes you can train on for how long before averaging them back together. It's an expensive question to answer, since you have to train many variations to get the data.