Training is compute bound, not memory bandwidth bound. That is how Cerebras is able to do training with external DRAM that only has 150GB/sec memory bandwidth.
They are training models that need terabytes of RAM with only 150GB/sec of memory bandwidth. That is compute bound. If you think it is memory bandwidth bound, please explain the algorithms and how they are memory bandwidth bound.