I mentioned that the sizes of the models are relatively small (13B max). Is it an inherent limitation, or training a bigger model is possible, just has not been done in this exercise?
1) Standard AI/ML scaling assumptions still apply on this hardware.
2) They have a starting point for hyper-parameter estimation and can get better results sooner.
The use of μ (mu) as a sort of… pun acronym thing is pretty clever, nice one.
I can only see Cerebras being an acquisition target, if they continue releasing their AI models out there. The value in Cerebras is their AI accelerator hardware and O̶p̶e̶n̶AI.com certainly needs that, since that is where the money is.
MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion.
For some reason muP was not trusted with the largest trainings? Why is that?