I could imagine FPGA designs might be competitive.
And dedicated ASIC's would almost certainly beat both by a decent margin.
The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.
Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.
Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)
So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.
- as most work is inference, might not need for as many GPUs
- consumer cards (24G) could possibly run the big models