* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.
* Model parallel alone is full performance, no need for data parallel if you size to fit.
* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.
* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.
I genuinely don't know how you'd build a simpler system than this.