2.7bn parameters (for the smaller model) means you have to do 2.7bn calculations for a single step of the model. You could fit the model in main memory, but how long is it going to take you to run all those calculations on a CPU? And the full model will need to run multiple times to output a single sentence.
The researchers here appear to have placed particular emphasis on cleaning up what the model is spitting out, but I think it's lipstick on a pig. The area begging for more research is parsing out the meaning of anything but the most simple sentence.
This is not that much different than what you do.
What criteria would you use to determine if something understands the meaning of a word/phrase/concept that isn't a string of definitions and metaphors? And at what level is sufficient?
Attempting to prove that something "understands the meaning" is a fruitless task with no quantifiable criteria - much like proving something is "conscious."
So meaning in this sense is very much quantifiable, yet how far along are we in parsing out even the most basic meanings? Can we build something which discriminates between "I'm moving in to <address>" and "I'm moving in on <date>", using the latest and greatest word embeddings? Not without some extra layers of external rules imposed on top. So the model does not 'understand', even in this limited scope of understanding.
Don't be fooled by the sentence recycling is all.