It pains me to say this, as I'm a researcher from an institution without the huge resources of the big tech companies, so I can't compete in the pretrained model arms race (and also, it has made the field more boring, as creative solutions to problems become outperformed by approaches that just pile up more millions of parameters). But it's the truth. Although I think it will only be a stage of things: at some point, performance will plateau and we will need to put our minds to work again, rather than our GPUs.
A picture may be with a thousand words, but you can send an entire book in the same amount of space as a single holiday snap at low resolution.
Historically, computationally expensive methods eventually become cheap. In the 1980's, researchers had access to Crays to develop physics model, graphics, etc. requiring lots of floating point math and memory. Meanwhile, for the home computers, game programmers had to implement all their math in fixed point. Nowadays, game engines run the same algorithms that were running on the Crays before.
Same with learning. It's great to use tricks to make models fit on phones. Even better: use tricks to make training new models within the budget of a small academic research lab. That doesn't mean we should invalidate all the work that requires a huge cluster.
But are they? The example in the article describes an incremental improvement in a benchmark in exchange for a massive increasing in training time.
Deep learning has achieved success on a number of tasks that previously computers had been unable to do. Since the initial period of success, it is an area of debate whether deep learning has expanded it's basic area of applicability or whether is has incrementally on it's initial achievements.
And if it is true that deep learning is stuck on just expanding what it's already doing, it might be the fundamental next advance might come from one person with one machine rather than a massive team with a massive machine. Consider that neural nets as a theory had been around since the 1990s if not the 1960s but the fundamental advantage of DL came when grad students could use GPU in the 2010s, not when massively parallel machines came into existence (quite a bit earlier).
Here, the further wrinkle is that moore's law is gradually ending. We won't access to that much more computing power twenty years hence - so making less do more does make sense.
One thing that I can't help wondering, however sci-fi it sounds, is if model simplifications like in this post might lead to models humans can fully understand, which then might lead to new styles of traditional programing - opening up whole new ways of doing things.
> Given the power requirements per card, a back of the envelope estimate put the amount of energy used to train this model at over 3X the yearly energy consumption of the average American.
So what? Training model is the hardest part, then you just reuse results
> First, it hinders democratization. If we believe in a world where millions of engineers are going to use deep learning to make every application and device better, we won’t get there with massive models that take large amounts of time and money to train.
So what? I can't run weather simulation on my laptop.
I doubt anyone is going to want to run a 33GB model on their phone.
So what? I can't run weather simulation on my laptop.
You only need to run the weather simulation once and then broadcast your forecast to everyone’s devices. You can’t do that with NLP. In order to be useful, NLP models need to run on different input data for every user. With a giant 33GB model, that means round-tripping to the data centre.
If you have to run everything in the cloud, your applications are limited. The cost is also very high, given that there are way more user devices than servers in the world. That means you need to build more data centres if you plan to run these giant models for every application you want to offer your users.
Why not? Many modern phones have upwards of 512GB of storage. 33 GB for a useful model seems entirely reasonable to me.
The training cost is more important than you think as well. To train a model normally requires 10's or 100's of experiments, meaning that we are consuming 30 -> 3k people's carbon, and the application of the model is typically narrow, so we end up doing 4 or 5 projects per year per group... meaning that we could spend 10's of k carbon per team to produce $10m's benefit. I wonder if we can justify this at all?
A great article on it: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elep...
Sounds like particle colliders and Big Science in general.