The compute and data moats are dead (opens in new tab)

(smerity.com)

82 pointsSmerity7y ago23 comments

23 comments

19 comments · 7 top-level

korethr7y ago· 6 in thread

Okay, I have a question about one of his assertions here:

> What may take a cluster to compute one year takes a consumer machine the next.

Is that not partly because the hardware is ever improving? I realize this is a bit of exaggeration, but does not yesterday's cluster end up fitting onto the die of tomorrow's GPU? And then since it's all on a single die, is not the overhead of the interconnect drastically reduced? It takes less time to push information to the next core over when the interconnect is a couple micrometers of silicon instead of the couple meters of silicon, copper, and fiber needed when the next core is in the next rack over.

Certainly improving the model will help; who hasn't marvelled at how better his code ran when he fixed that On^2 hot spot? But I can't help but think improving hardware plays a role too.

Am I off base here?

yetihehe7y ago

Off base not, but most of this "one year effect" is really improved algorithms, not better hardware. Hardware doesn't improve that fast (1000x in one year).

heavenlyblue7y ago

One year effect? What exactly are they speaking about?

The only reason deep learning exists is because by now we've finally learned how to build GPUs fast enough to run the algorithm invented in 1983.

And let's be honest - most of the current, state-of-the-art algorithms only work today because they've got access to scaled up massive databases of data. You don't really need to be as smart here any more.

2 more replies

MR4D7y ago

The author specifically states a hardware example:

Neural Architecture Search: "32,400-43,200 GPU hours" Just over a year later: "single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours" (1000x less) (paper)

jfoutz7y ago

I think there was a paper about that in the 90’s. If your program required more than X years of supercomputer time, you will get your answer faster by waiting n years to start executing.

If I remember correctly it was X=5 and n=2.

Moore’s law was magical

heavenlyblue7y ago

Yes, you are.

Most of hard problems in computer science are that of highly-branched connectivity (both in terms of the number of connections that are mode from the compute node and in terms of the branching required in the algorithms).

E.G. 1 - branched) if it takes you 100ms to send 1Gb of data over network, but you don't have to wait for it's result, then latency does not matter at all.

E.G. 2 - number of connections to the node) if you have a matrix value that depends on a full row in a matrix, this matrix value somehow needs to be computationally connected to a full row in a matrix. Cluster or GPU, these physical connections need to be made either through Ethernet or by reading values from RAM. Again, the latency does not matter as long as you can pipeline these.

So frankly, no. It's not the micrometers on the silicon that matters, it's just that our hardware had been scaled so much that a 50 year old compute cluster can be literally scaled down to a single die. It's _not_ more efficient, though.

twtw7y ago

> be made either through Ethernet or by reading values from RAM

Ok? This is not a rebuttal of the parent comment at all. You could also read and write the data on notecards and compute with a mechanical calculator, but that's totally irrelevant to the parent's point. Hardware matters a lot. Other things certainly matter as well, but saying the structure of the computation is fixed doesn't mean hardware can't change the performance (in absolute perf, perf/W, and perf/$) by many orders of magnitude.

ThePhysicist7y ago· 4 in thread

At the PyData DE I just saw an excellent talk about GANs and data augmentation in image recognition:

https://www.slideshare.net/FlorianWilhelm2/performance-evalu...

The authors were able to outperform Google ML by a large margin for a vision task that involved recognizing numbers from car registration documents. With just 160 manually collected training samples they were able to train a neural net that could recognize characters with 99.7 % accuracy. GoogleML performed very poorly in comparison, which I found very surprising because it didn't seem to be such a hard recognition task (clean, machine-written characters on a structured, green background).

QML7y ago

Isn’t this just the no free lunch theorem? Should you be expecting a more general framework to beat an specially trained algorithm?

Another concern is generality: just because it performs well on this dataset does not mean it will perform well on another.

ThePhysicist7y ago

It really doesn't matter that the model will probably not perform well on another dataset, as it was built for a specific task.

It's also about flexibility: If Google ML doesn't provide you with a way to train their algorithms specifically for your use case it won't help you that they work well for generic text recognition tasks.

deelowe7y ago

Isn't that a bit like a synthetic benchmark though? This reads to me like those mongodb vs mysql comparisons that were made 7 years ago where they compared object store efficiencies for the two.

ThePhysicist7y ago

I think it shows nicely that you can beat a "professional" ML toolkit for a specific problem.

And most people don't want to solve many different ML problems but just a single one, so I think the result is quite interesting.

stcredzero7y ago· 1 in thread

Moats and walls are pretty good analogies here. The magnitude of the barrier provided by a moat was considerable in the 1300's. By the late 20th century, it was far less.

https://www.youtube.com/watch?v=bWMrY49qqDw

In the 11th century, a wooden palisade or an earthen berm fortification could be held for something like a half year. By the end of WWII, it constituted a delaying tactic.

https://en.wikipedia.org/wiki/Rhino_tank

A phase change happened with military tactics in the lead-up to the 1st half of the 20th century, where the power of mobile mechanized armor and air support greatly reduced the value of fortifications.

That said, I don't think moats are dead. It's just that the time-scales have changed.

blihp7y ago

The timescales of the moats provided by the actual technology in the digital realm have always been short. That's why you see so much time and money spent on broadening and extending copyright and patents... that's the long-term moat.

aub3bhat7y ago· 1 in thread

I think you are overegenralizing applicability of Neural Architecture Search etc. and cherry picking individual examples. There is an enormous gap between what gets published in academia with what’s actually useful.

E.g. Compute wars have only intensified with TPUs and FPGA. sure for training you might be okay with few 1080ti but good luck building any reliable, cheap and low latency service that uses DNNs. Similarly big data for academia is few terabytes but real Big data is Petabytes of street level imagery, Videos/Audio etc.

gjstein7y ago

Your last comment reminded me of this article [1] on "Google Maps's Moat", which discusses the vast resources that Google has poured into collecting data at a global scale to make Google Maps what it is.

[1] https://www.justinobeirne.com/google-maps-moat/

QML7y ago

Two notes: 1. This article is not talking about how, for neural networks, you can just have pretrained networks — where the cost of compute and data is incurred there — and then use them to classify images or what not on your decades old computer. Correct? 2. Often times, some problems are “solved” in the sense that they become irrelevant. Is that also the case here. It seems compute and data were seemingly constraints, but technology (algorithms) just got more efficient. Should we not reframe this and say that algorithms are the constraint, then, and that’s what we should aspire to improve? Usually throwing compute and data marginally improves gains anyhow...

PaulHoule7y ago

The real "moat" is more and better training data for commercially useful tasks.

You can write a lot of papers about Penn Treebank data but I can't imagine anything you do with Penn Treebank will be commercially useful.

fizx7y ago

I feel like that we're getting these huge gains on tasks that can be made faster via better architectures, regularization, normalization, data augmentation, etc, such that he's right.

I just wonder if it will ever feel this way for reinforcement learning.

j / k navigate · click thread line to collapse

23 comments

19 comments · 7 top-level

korethr7y ago· 6 in thread

Okay, I have a question about one of his assertions here:

> What may take a cluster to compute one year takes a consumer machine the next.

Certainly improving the model will help; who hasn't marvelled at how better his code ran when he fixed that On^2 hot spot? But I can't help but think improving hardware plays a role too.

Am I off base here?

yetihehe7y ago

Off base not, but most of this "one year effect" is really improved algorithms, not better hardware. Hardware doesn't improve that fast (1000x in one year).

heavenlyblue7y ago

One year effect? What exactly are they speaking about?

The only reason deep learning exists is because by now we've finally learned how to build GPUs fast enough to run the algorithm invented in 1983.

2 more replies

MR4D7y ago

The author specifically states a hardware example:

Neural Architecture Search: "32,400-43,200 GPU hours" Just over a year later: "single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours" (1000x less) (paper)

jfoutz7y ago

I think there was a paper about that in the 90’s. If your program required more than X years of supercomputer time, you will get your answer faster by waiting n years to start executing.

If I remember correctly it was X=5 and n=2.

Moore’s law was magical

heavenlyblue7y ago

Yes, you are.

E.G. 1 - branched) if it takes you 100ms to send 1Gb of data over network, but you don't have to wait for it's result, then latency does not matter at all.

twtw7y ago

> be made either through Ethernet or by reading values from RAM

ThePhysicist7y ago· 4 in thread

At the PyData DE I just saw an excellent talk about GANs and data augmentation in image recognition:

https://www.slideshare.net/FlorianWilhelm2/performance-evalu...

QML7y ago

Isn’t this just the no free lunch theorem? Should you be expecting a more general framework to beat an specially trained algorithm?

Another concern is generality: just because it performs well on this dataset does not mean it will perform well on another.

ThePhysicist7y ago

It really doesn't matter that the model will probably not perform well on another dataset, as it was built for a specific task.

deelowe7y ago

Isn't that a bit like a synthetic benchmark though? This reads to me like those mongodb vs mysql comparisons that were made 7 years ago where they compared object store efficiencies for the two.

ThePhysicist7y ago

I think it shows nicely that you can beat a "professional" ML toolkit for a specific problem.

And most people don't want to solve many different ML problems but just a single one, so I think the result is quite interesting.

stcredzero7y ago· 1 in thread

Moats and walls are pretty good analogies here. The magnitude of the barrier provided by a moat was considerable in the 1300's. By the late 20th century, it was far less.

https://www.youtube.com/watch?v=bWMrY49qqDw

In the 11th century, a wooden palisade or an earthen berm fortification could be held for something like a half year. By the end of WWII, it constituted a delaying tactic.

https://en.wikipedia.org/wiki/Rhino_tank

That said, I don't think moats are dead. It's just that the time-scales have changed.

blihp7y ago

aub3bhat7y ago· 1 in thread

gjstein7y ago

[1] https://www.justinobeirne.com/google-maps-moat/

QML7y ago

PaulHoule7y ago

The real "moat" is more and better training data for commercially useful tasks.

You can write a lot of papers about Penn Treebank data but I can't imagine anything you do with Penn Treebank will be commercially useful.

fizx7y ago

I feel like that we're getting these huge gains on tasks that can be made faster via better architectures, regularization, normalization, data augmentation, etc, such that he's right.

I just wonder if it will ever feel this way for reinforcement learning.

j / k navigate · click thread line to collapse