undefined | Better HN

0 pointswhimsicalism2y ago0 comments

Yeah, your intuition that this would destroy cohesion is correct.

It's basically not possible to do what you are trying to do in an async manner. With advancements in large batch gradients, it might be possible to do some sort of synchronous P2P gradient averaging.

0 comments

alchemist1e92y ago

Edit: I’m reading this to try and get some sense of the issues - https://www.amazon.science/blog/near-linear-scaling-of-gigan...

What about with some fairly frequent and periodic synchronization?

Is there potentially some balance where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion. I was thinking maybe this algorithm would be 10x less energy efficient but have the benefit of decentralization. Something along those lines.

I’m guessing the current training algorithms do something like this but since rapid synchronization always makes the efficiency increase (in the extreme that giant single wafer cpu) then openAI and others use systems with high interconnect bandwidth.

whimsicalismOP2y ago

I am not familiar with that work.

> where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion

I think this really probably depends on the terrain of your loss landscape. My intuition is that many are too spike-y and if you take a step or two in each of your subsets and then average them, you will end up on a steep hill rather than a valley between your two points.

But this is an active area of research for sure.

j / k navigate · click thread line to collapse