Deep learning has a size problem (opens in new tab)

(heartbeat.fritz.ai)

131 pointsjamesonthecrow6y ago45 comments

45 comments

36 comments · 13 top-level

RocketSyntax6y ago· 6 in thread

Deep learning doesn't parallelize well. Would be cool if you could loan CPU cycles on your phone or home computers while at work.

pheug6y ago

Actually it parallelizes extremely well, so that large companies are able to create monster models like mentioned in the article in the first place by just throwing money at the problem with TPUs and similar highly parallelized accelerators. It just doesn't lend itself well to distributed computing due to e.g. throughput requirements.

RocketSyntax6y ago

That's just vertical scale. Distributed is what I was referring to. See comment below.

question_away6y ago

In what way does it not parallelize well? There are mounds of research in federated learning.

kyle_grove6y ago

In fact, one of the chief advantages of the BERT/Transformer architecture over ELMO/LSTM is the ability to parallelize.

RocketSyntax6y ago

I've read that you can't split up large layers to be trained on separate processors either horizontally (one layer per processor) or vertically (parts of many layers).

1 more reply

bitL6y ago

RNNs (LSTM/GRU) tend to have issues with scaling. Attention-based models like Transformer on the other hand scale extremely well.

buboard6y ago· 5 in thread

The article starts with NLP models and then mentions the successes of increasingly smaller vision models. NLP seems to be an outlier in increasingly becoming a pissing contest. The models are too big and not particularly useful. openAI spread FUD about their model but after their release , it's rather underwhelming. Yeah you can output some text that's readable and paraphrasing reddit, but what about understanding , intention, doing actual useful stuff with text? Hallucinating text in itself isn't interesting. It seems this line of nlp with transformers has hit some kind of deadend and they are trying to brute force the next breakthrough - doubtful that this will happen though. And then we have bizarre decisions like microsoft releasing dialoGPT yesterday without including a generaiton script because "it might be racist". This whole seems more like marketing than research

Al-Khwarizmi6y ago

Large transformer-based models like BERT and its ilk are not only useful to hallucinate text. They have achieved measurable improvements in various (although not all) classic NLP tasks, such as parsing, entailment recognition or question answering. Google has reportedly used BERT to improve their search algorithm, so indeed it's being used to do "actual useful stuff with text".

It pains me to say this, as I'm a researcher from an institution without the huge resources of the big tech companies, so I can't compete in the pretrained model arms race (and also, it has made the field more boring, as creative solutions to problems become outperformed by approaches that just pile up more millions of parameters). But it's the truth. Although I think it will only be a stage of things: at some point, performance will plateau and we will need to put our minds to work again, rather than our GPUs.

buboard6y ago

google seemed to make a genuine effort to make a model that is useful rather than record-breaking with bert. But i think it's wrong to consider it the "final" model upon which everything else will be built.

1 more reply

ivalm6y ago

As someone who was able to generate a model for production based on BERT that outperformed all our previous attempts, I have to say transformers really are a game changer. They are not the end all be all, but they are really, really good as being the basis of many different classification tasks.

hnaccy6y ago

Any tips in terms of taking BERT style model to production?

joshvm6y ago

There is at least one simple reason for obsessing over efficiency for computer vision models. It takes a lot of bandwidth to transmit an image (even a small one) over the air, whereas text is cheap.

A picture may be with a thousand words, but you can send an entire book in the same amount of space as a single holiday snap at low resolution.

blt6y ago· 5 in thread

IMO this is not a problem. The people building insanely huge models are expanding the set of tasks that can be done by a computer. Who cares how much memory it takes?

Historically, computationally expensive methods eventually become cheap. In the 1980's, researchers had access to Crays to develop physics model, graphics, etc. requiring lots of floating point math and memory. Meanwhile, for the home computers, game programmers had to implement all their math in fixed point. Nowadays, game engines run the same algorithms that were running on the Crays before.

Same with learning. It's great to use tricks to make models fit on phones. Even better: use tricks to make training new models within the budget of a small academic research lab. That doesn't mean we should invalidate all the work that requires a huge cluster.

joe_the_user6y ago

IMO this is not a problem. The people building insanely huge models are expanding the set of tasks that can be done by a computer. Who cares how much memory it takes?

But are they? The example in the article describes an incremental improvement in a benchmark in exchange for a massive increasing in training time.

Deep learning has achieved success on a number of tasks that previously computers had been unable to do. Since the initial period of success, it is an area of debate whether deep learning has expanded it's basic area of applicability or whether is has incrementally on it's initial achievements.

And if it is true that deep learning is stuck on just expanding what it's already doing, it might be the fundamental next advance might come from one person with one machine rather than a massive team with a massive machine. Consider that neural nets as a theory had been around since the 1990s if not the 1960s but the fundamental advantage of DL came when grad students could use GPU in the 2010s, not when massively parallel machines came into existence (quite a bit earlier).

Here, the further wrinkle is that moore's law is gradually ending. We won't access to that much more computing power twenty years hence - so making less do more does make sense.

Izkata6y ago

> And if it is true that deep learning is stuck on just expanding what it's already doing, it might be the fundamental next advance might come from one person with one machine rather than a massive team with a massive machine. Consider that neural nets as a theory had been around since the 1990s if not the 1960s but the fundamental advantage of DL came when grad students could use GPU in the 2010s, not when massively parallel machines came into existence (quite a bit earlier).

One thing that I can't help wondering, however sci-fi it sounds, is if model simplifications like in this post might lead to models humans can fully understand, which then might lead to new styles of traditional programing - opening up whole new ways of doing things.

acollins13316y ago

I disagree. There are lots of advancements that DL has yet to fully realize with even the current technology. You're focused on commercial applications but applying neural network models, especially CV models to many types of scientific research has yet to be explored due to lack of funding.

1 more reply

KON_Air6y ago

I find this weird too, question of "miniaturization" should come after theoretical stage is satisfied. Is this coming from a line of thinking where capitalistic sense avoids high costs or strict design sensibility where optimizition is a primary concern? The nuance is tiny but very important.

ladberg6y ago

I agree, but the main reason why "miniaturization" exists is that it can be done in parallel with theoretical developments and allows you to make money off the results (therefore funding more R&D).

galkk6y ago· 3 in thread

I never understand such remarcs

> Given the power requirements per card, a back of the envelope estimate put the amount of energy used to train this model at over 3X the yearly energy consumption of the average American.

So what? Training model is the hardest part, then you just reuse results

> First, it hinders democratization. If we believe in a world where millions of engineers are going to use deep learning to make every application and device better, we won’t get there with massive models that take large amounts of time and money to train.

So what? I can't run weather simulation on my laptop.

chongli6y ago

So what? Training model is the hardest part, then you just reuse results

I doubt anyone is going to want to run a 33GB model on their phone.

So what? I can't run weather simulation on my laptop.

You only need to run the weather simulation once and then broadcast your forecast to everyone’s devices. You can’t do that with NLP. In order to be useful, NLP models need to run on different input data for every user. With a giant 33GB model, that means round-tripping to the data centre.

If you have to run everything in the cloud, your applications are limited. The cost is also very high, given that there are way more user devices than servers in the world. That means you need to build more data centres if you plan to run these giant models for every application you want to offer your users.

phoboslab6y ago

> I doubt anyone is going to want to run a 33GB model on their phone.

Why not? Many modern phones have upwards of 512GB of storage. 33 GB for a useful model seems entirely reasonable to me.

1 more reply

sgt1016y ago

What are the applications of deep learning that look like weather simulations (as in one run -> results to 10m people?) In my experience deep learning systems are aimed at applications that are single use 1 run -> 1 person.

The training cost is more important than you think as well. To train a model normally requires 10's or 100's of experiments, meaning that we are consuming 30 -> 3k people's carbon, and the application of the model is typically narrow, so we end up doing 4 or 5 projects per year per group... meaning that we could spend 10's of k carbon per team to produce $10m's benefit. I wonder if we can justify this at all?

phkahler6y ago· 2 in thread

Seems to focus on reducing the size of existing models through optimization. Better would be to find ways to train smaller models to start with. Still interesting.

rytill6y ago

why would that be better?

gumby6y ago

Compressing it means it may take less storage, but not having to look at it in the first place it the win. It simply takes time to process all the data. Less data: faster computation.

pixelpoet6y ago· 1 in thread

How did they use an elephant as cover image without mentioning von Neumann's famous and relevant quote: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."

A great article on it: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elep...

RocketSyntax6y ago

That came up in The Dream Machine! Reading it now.

bitL6y ago· 1 in thread

We are already past the point of no return. RTX 8000 is now an entry-level GPU that allows training some of the latest NLP models. Attention is spreading over to computer vision models as well, so one could expect memory bloat coming there quickly. Only large companies that can deploy thousands of GPUs in parallel will be able to compete.

latchkey6y ago

I am working on it... (well, the company I work for)... except instead of thousands... it is hundreds of thousands.

visarga6y ago

It's not such a problem, except if you want to train from scratch a large model (NLP or CV), not if you want to fine-tune it for a related task. So one trained model can be reused many times. In general training data is scarce, only in a few situations it is abundant.

gok6y ago

The MegatronLM example is a weird one. Neural network language models are replacing n-gram language models that grow to several terabytes for SotA results; 8 billion parameters is tiny by comparison.

boyadjian6y ago

Size matters. If you want intelligent neural network, you need some watts. There is nothing astonishing in that. It is also because of constant progress in hardware performance that deep learning has become what it is.

cellular6y ago

I just hope Hinton finishes his Hinton Network idea that is supposed to replace these NNs.

jgalt2126y ago

> I don’t mean to single out this particular project. There are many examples of massive models being trained to achieve ever-so-slightly higher accuracy on various benchmarks.

Sounds like particle colliders and Big Science in general.

RocketSyntax6y ago

It doesn't parallelize well. Would be cool if you could loan CPU cycles on your phone or home computers while at work.

j / k navigate · click thread line to collapse

45 comments

36 comments · 13 top-level

RocketSyntax6y ago· 6 in thread

Deep learning doesn't parallelize well. Would be cool if you could loan CPU cycles on your phone or home computers while at work.

pheug6y ago

RocketSyntax6y ago

That's just vertical scale. Distributed is what I was referring to. See comment below.

question_away6y ago

In what way does it not parallelize well? There are mounds of research in federated learning.

kyle_grove6y ago

In fact, one of the chief advantages of the BERT/Transformer architecture over ELMO/LSTM is the ability to parallelize.

RocketSyntax6y ago

I've read that you can't split up large layers to be trained on separate processors either horizontally (one layer per processor) or vertically (parts of many layers).

1 more reply

bitL6y ago

RNNs (LSTM/GRU) tend to have issues with scaling. Attention-based models like Transformer on the other hand scale extremely well.

buboard6y ago· 5 in thread

Al-Khwarizmi6y ago

buboard6y ago

1 more reply

ivalm6y ago

hnaccy6y ago

Any tips in terms of taking BERT style model to production?

joshvm6y ago

There is at least one simple reason for obsessing over efficiency for computer vision models. It takes a lot of bandwidth to transmit an image (even a small one) over the air, whereas text is cheap.

A picture may be with a thousand words, but you can send an entire book in the same amount of space as a single holiday snap at low resolution.

blt6y ago· 5 in thread

IMO this is not a problem. The people building insanely huge models are expanding the set of tasks that can be done by a computer. Who cares how much memory it takes?

joe_the_user6y ago

IMO this is not a problem. The people building insanely huge models are expanding the set of tasks that can be done by a computer. Who cares how much memory it takes?

But are they? The example in the article describes an incremental improvement in a benchmark in exchange for a massive increasing in training time.

Here, the further wrinkle is that moore's law is gradually ending. We won't access to that much more computing power twenty years hence - so making less do more does make sense.

Izkata6y ago

acollins13316y ago

1 more reply

KON_Air6y ago

ladberg6y ago

I agree, but the main reason why "miniaturization" exists is that it can be done in parallel with theoretical developments and allows you to make money off the results (therefore funding more R&D).

galkk6y ago· 3 in thread

I never understand such remarcs

> Given the power requirements per card, a back of the envelope estimate put the amount of energy used to train this model at over 3X the yearly energy consumption of the average American.

So what? Training model is the hardest part, then you just reuse results

So what? I can't run weather simulation on my laptop.

chongli6y ago

So what? Training model is the hardest part, then you just reuse results

I doubt anyone is going to want to run a 33GB model on their phone.

So what? I can't run weather simulation on my laptop.

phoboslab6y ago

> I doubt anyone is going to want to run a 33GB model on their phone.

Why not? Many modern phones have upwards of 512GB of storage. 33 GB for a useful model seems entirely reasonable to me.

1 more reply

sgt1016y ago

phkahler6y ago· 2 in thread

Seems to focus on reducing the size of existing models through optimization. Better would be to find ways to train smaller models to start with. Still interesting.

rytill6y ago

why would that be better?

gumby6y ago

Compressing it means it may take less storage, but not having to look at it in the first place it the win. It simply takes time to process all the data. Less data: faster computation.

pixelpoet6y ago· 1 in thread

How did they use an elephant as cover image without mentioning von Neumann's famous and relevant quote: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."

A great article on it: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elep...

RocketSyntax6y ago

That came up in The Dream Machine! Reading it now.

bitL6y ago· 1 in thread

latchkey6y ago

I am working on it... (well, the company I work for)... except instead of thousands... it is hundreds of thousands.

visarga6y ago

gok6y ago

The MegatronLM example is a weird one. Neural network language models are replacing n-gram language models that grow to several terabytes for SotA results; 8 billion parameters is tiny by comparison.

boyadjian6y ago

cellular6y ago

I just hope Hinton finishes his Hinton Network idea that is supposed to replace these NNs.

jgalt2126y ago

> I don’t mean to single out this particular project. There are many examples of massive models being trained to achieve ever-so-slightly higher accuracy on various benchmarks.

Sounds like particle colliders and Big Science in general.

RocketSyntax6y ago

It doesn't parallelize well. Would be cool if you could loan CPU cycles on your phone or home computers while at work.

j / k navigate · click thread line to collapse