It's pretty much just scale, either via Dataset size or parameter size. Before GPT-4, the general SOTA model was not in fact from Open AI (Flan-PaLM from Google).
The attention from GPT-4 is a little different (probably some kind of flash attention) so that memory requirements for longer contexts are no longer quadratic. But there's nothing to suggest the intellectual gains from 4 isn't just bigger scale.
Google could have made a 4 equivalent I'm sure. It's not like there wasn't a road to take. We already knew 3 was severely undertrained even from a computer optimal perspective. And then of course, you can just train on even more tokens to get them even better.