undefined | Better HN

Skip to content

Top Best Ask Show New Jobs

0 pointsalchemist1e93y ago0 comments

Can they mathematically be “mushed” and then create an improved model?

I have yet to understand the difference between fine tuning and training and therefore yet to understand if a distributed decentralized eventually consistent training approach is a possibility or simply not realistic.

0 comments

4 comments · 1 top-level

tlb3y ago· 3 in thread

If you make N copies of a model, train them independently for a little while on N machines, and average them back together, it sort of works. But not if you train for very long, as the internal structure diverges.

It becomes an empirical engineering question how many parallel nodes you can train on for how long before averaging them back together. It's an expensive question to answer, since you have to train many variations to get the data.

alchemist1e9OP3y ago

I was thinking if you can fine tune / train on a restricted subspace of the weights? If so they one can assign specific partitioned subspaces and then the averaging wouldn’t overlap, however maybe that would destroy some valuable cohesion.

tlb3y ago

I haven't heard of that being tried (though I don't read everything.) Someone could do the experiment and write it up, and maybe get it published. The main ML conferences rarely publish anything that's not an improvement on the SOTA, which is why it's so hard to find anything about ideas that don't quite work.

whimsicalism3y ago

Yeah, your intuition that this would destroy cohesion is correct.

It's basically not possible to do what you are trying to do in an async manner. With advancements in large batch gradients, it might be possible to do some sort of synchronous P2P gradient averaging.

j / k navigate · click thread line to collapse