The other clear benefit of transformers over an arch like RNNs (and what has probably made more of a difference imo) is that its properly parallelizable, which means you can do huge training runs in a fraction of the time. RNNs might be able to get to a level of coherence that approaches GPT-3, but with current hardware that would be very time-prohibitive.