The Bitter Lesson (2019) (opens in new tab)

(incompleteideas.net)

98 pointsradkapital5y ago85 comments

85 comments

71 comments · 27 top-level

aszen5y ago· 10 in thread

Interesting, I wonder what happens now that Moore's law is considered dead and we can't rely on computation power increasing year over year. To make further progess with general purpose search and learning methods we will need lots more computational power which may not be cheaply available. Then do we focus our efforts on developing more efficient learning strategies like the one we have in our minds ?

I do agree with the part about not embedding human knowledge into our computer models, any knowledge worth learning about any domain the computer should be able learn on its own to make true progress in AI.

noanabeshima5y ago

The amount of compute used in the largest AI training runs has been exponentially growing:

https://openai.com/blog/ai-and-compute/

The amount of compute required for Imagenet classification has been exponentially decreasing:

https://openai.com/blog/ai-and-efficiency/

aglionby5y ago

My background is in NLP - I suspect we'll see similar in language processing models as we've seen in vision models. Consider this[1] article ("NLP's ImageNet moment has arrived"), comparing AlexNet in 2012 to the first GPT model 6 years later: we're just a few years behind.

True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.

We simultaneously have people doing research in the 'tock' - reducing the compute needed. ICLR 2020 was full of alternative training schema that required less compute for similar performance (e.g. ELECTRA[2]). Model distillation is another interesting idea that reduces the amount of inference-time compute needed.

[1] https://thegradient.pub/nlp-imagenet/

[2] https://openreview.net/pdf?id=r1xMH1BtvB

aszen5y ago

Very interesting links, thanks for sharing.

So the trend isn't changing we still need bigger models to make progress in NLP and CV, while the algorithmic effeciencies are promising but they aren't giving anywhere near the same improvements as larger models.

I'm curious how long this trend will continue and if there's anything promising that can reverse this trend

1 more reply

abetusk5y ago

Moore's law might be dead but the deeper law is still alive.

Moore's law is technically "the number of transistors per unit area doubles every 24 months" [1]. The more important law is that the cost of transistors halves every 18-24 months.

That is, Moore's law talks about how many transistors we can pack into a unit area. The deeper issue is how much it costs. If we can only pack in a certain amount transistors per area but the cost drops exponentially, we still see massive gains.

There's also Wright's law that comes into play [3] that talks about dropping exponential costs just from institutional knowledge (2x in production leads to (.75-.9)x in cost).

[1] https://en.wikipedia.org/wiki/Moore%27s_law

[2] https://www.youtube.com/watch?v=Nb2tebYAaOA

[3] https://en.wikipedia.org/wiki/Experience_curve_effects

aszen5y ago

Agreed the cost aspect of Moore's law may continue to remain true, especially with chiplets with varying fab nodes and 3d architectures. Wright's law will also bring down costs as lower nm nodes mature.

But as mentioned in the comments below ai model training is increasing exponentially (compute required to train models has been doubling every 3.6 months) so it still far outstrips the cost savings.

edjrage5y ago

It really irks me that these things are called "laws". A law is something we expect to hold true forever, by means of the hypothetico-deductive scientific method.

They're phenomena. They're patterns we observe, and that's it. The pattern may change anytime, and that's something that should be expected. The causes may be known or unknown, but to call it a law may even make it hold true for longer, for "psychological" reasons. The law of gravity isn't influenced by what SpaceX investors think about it.

PeterisP5y ago

Can you elaborate why you think that Moore's law is considered dead? For me it seems that the general progress for the computing hardware in question (GPUs and specialized ASICs, not consumer CPUs) we're still seeing steady improvements in transistors/$ and flops/$ and expect it to still continue for some time at least.

aszen5y ago

Yes specialized hardware for AI are seeing steady improvements, I'm curious if these improvements rely on the particulars of the algorithms running on these machines. As an example several of the AI chips use lower precision floating point numbers than general CPUs since the algorithms in use for training nns don't need the higher precision.

I actually wonder if having specialized AI hardware isn't the same problem as having specialized AI models, that is in the short term it will improve efficiency but in the long run prevent discovery of newer general learning strategies because they won't run faster in existing specialized hardware.

hpoe5y ago

So I know Moore's law is "dead" (dead as in Cobol or dead as in Elvis?) and progress is definitely slower than it has been historically however we have only began to really start leveraging parallelization at scale from a software perspective, so I think we have some runway in that direction, and of course the looming elephant on the horizon, Quantum computing.

Sure it is in it's infancy but assuming that the research continues to prove that quantum computing is viable I expect it to be an even bigger deal than the move from vacuum tubes to transistors. At that point we'll be dealing with an entirely different world in computing.

nessunodoro5y ago

it's kind of poetic that the chief bottleneck of advancement in the field is now the physical universe -

YeGoblynQueenne5y ago· 9 in thread

>> In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.

"Massive, deep search" that started from a book of opening moves and the combined expert knowledge of several chess Grandmasters. And that was an instance of the minimax algorithm with alpha-beta cutoff, i.e. a search algorithm specifically designed for two-player, deterministic games like chess. And with a hand-crafted evaluation function, whose parameters were filled-in by self-play. But still, an evaluation function; because the minimax algorithm requires one and blind search alone did not, could not, come up with minimax, or with the concept of an evaluation function in a million years. Essentially, human expertise about what matters in the game was baked-in to Deep Blue's design from the very beginning and permeated every aspect of its design.

Of course, ultimately, search was what allowed Deep Blue to beat Kasparov (3½–2½; Kasparov won two games and drew another). That, in the sense that the alpha-beta minimax algorithm itself is a search algorithm and it goes without saying that a longer, deeper, better search will inevitably eventually outperform whatever a human player is doing, which clearly is not search.

But, rather than an irrelevant "bitter" lesson about how big machines can perfom more computations than a human, a really useful lesson -and one that we haven't yet learned, as a field- is why humans can do so well without search. It is clear to anyone who has played any board game that humans can't search ahead more than a scant few ply, even for the simplest games. And yet, it took 30 years (counting from the Dartmouth workshop) for a computer chess player to beat an expert human player. And almost 60 to beat one in Go.

No, no. The biggest question in the field is not one that is answered by "a deeper search". The biggest question is "how can we do that without a search"?

Also see Rodney Brook's "better lesson" [2] addressing the other successes of big search discussed in the article.

_____________

[1] https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)#Des...

[2] https://rodneybrooks.com/a-better-lesson/

mtgp10005y ago

>But, rather than an irrelevant "bitter" lesson about how big machines can perfom more computations than a human, a really useful lesson -and one that we haven't yet learned, as a field- is why humans can do so well without search

I think the answer is heuristics based on priors(e.g. board state), which we've demonstrated (with alphago and derivatives, especially alphago zero) that neural networks are readily able to learn.

This is why I get the impression that modern neural networks are quickly approaching humanlike reasoning - once you figure out how to

(1) encode (or train) heuristics and

(2) encode relationships between concepts in a manner which preserves a sort of topology (think for example of a graph where nodes represent generic ideas)

You're well on your way to artificial general reasoning - the only remaining question becomes one of hardware (compute, memory, and/or efficiency of architecture).

dreamcompiler5y ago

Are we certain that well-trained human players are not doing search? It's possible that a search subnetwork gets "compiled without debugger symbols" and the owner of the brain is simply unaware that it's happening.

YeGoblynQueenne5y ago

>> Are we certain that well-trained human players are not doing search?

Yes- because human players can only search a tiny portion of a game tree and a minimax search of the same extent is not even sufficient to beat a dedicated human in tic-tac-to, leta lone chess. That is, unless one wishes to countenance the possibility of an "unconscious search" which of course might as well be "the grace of God" or any such hand-wavy non-explanation.

>> It's possible that a search subnetwork gets "compiled without debugger symbols" and the owner of the brain is simply unaware that it's happening.

Sorry, I don't understand what you mean.

2 more replies

gwern5y ago

I'm not sure why YeGoblynQueenne thinks this is such a mystery. (This is not the first time I've been puzzled by their pessimism on HN.) There is no mystery here: AlphaZero shows that you can get superhuman performance by searching only a few ply by sufficiently good pattern recognition in a highly parameterized and well-trained value function, and MuZero makes this point even more emphatically by doing away with the formal search entirely in favor of an more abstract recurrent pondering. What more is there to say?

2 more replies

bsder5y ago

> Are we certain that well-trained human players are not doing search?

Some, but there's a LOT more context pruning the search space.

Watch some of the chess grandmasters play and miss obvious winning moves. Why? "Well, I didn't bother looking at that because <insert famous grandmaster> doesn't just hang a rook randomly."

new26285y ago

At least in chess, if it is not the search, then it is probably the evaluation function.

Expert players have likely a very well-tuned evaluation function of how strong a board "feels". Some of it is explainable easily: center domination, diagonal bishop, connected pawn structure, rook supporting pawn from behind, others are more elaborate, come with experience and harder to verbalize.

When expert players play against computers, the limitation of their evaluation function becomes visible. Some board may feel strong, but you are missing some corner case that the minmax search observes and exploits.

YeGoblynQueenne5y ago

I like to caution against taking concepts from computer science and AI and applying them directly to the way the human mind works. Unless we know that a player is applying a specific evaluation function (e.g. because they tell us, or because they vocalise their thought process etc) then even suggesting that "players have an evaluation function" is extrapolating far from what it is safe. For one thing- what does a "function" look like in the human mind?

Whatever human minds do, computing is only a very general metaphor for it and it's very risky to assume we understand anything about our mind just because we understand our computers.

burntoutfire5y ago

> No, no. The biggest question in the field is not one that is answered by "a deeper search". The biggest question is "how can we do that without a search"?

My guess is that we're doing pattern recognision, where we recognize taht a current game state is similar to a situation that we've been in before (in some previous game), and recall the strategy we took and the outcomes it had lead to. With large enough body of experience, you can to remember lots of past attempted strategies for every kind of game state (of course, within some similarity distance).

blt5y ago

This insight is the essence of the AlphaZero architecture. Whereas a pure Monte Carlo Tree Search (MCTS) starts each node in the search tree with a uniform distribution over actions, AlphaZero trains a neural network to observe the game state and output a distribution over actions. This distribution is optimized to be as similar as possible to the distribution obtained from running MCTS from that state in the past. It's very similar to the way humans play games.

JoeAltmaier5y ago· 5 in thread

Got to believe, this is like heroin. Its a win until it isn't. Then where will AI researchers be? No progress for 20 (50?) years because the temptation to not understand but to just build performant engineering solutions, was so strong.

In fact, is the researcher supposed to be building the most performant solution? This article seems alarmingly misinformed. To understand 'artificial intelligence' isn't a race to VC money.

otoburb5y ago

>>This article seems alarmingly misinformed.

I hate appeals to authority as much as anybody else on HN, but I'm not sure that we could say Rich Sutton[1] is "misinformed". He's an established expert in the field, and if we discount his academic credentials then at least consider he's understandably biased towards this line of thinking as one of the early pioneers of reinforcement learning techniques[2] and currently a research scientist at DeepMind leading their office in Alberta, Canada.

[1] https://en.wikipedia.org/wiki/Richard_S._Sutton

[2] http://incompleteideas.net/papers/sutton-88-with-erratum.pdf

JoeAltmaier5y ago

He's writing that article for a reason, to be sure. Its just not the one that the article says its about, I'm thinking.

visarga5y ago

AI as a field relied mostly on 'understanding' based approaches for 50 years without much success. These approaches were too brittle and ungrounded. Why return to something that doesn't work?

DNNs today can generate images that are hard to distinguish from real photos, super natural voices and surprisingly good text. They can beat us at all board games and most video games. They can write music and poetry better than the average human. Probably also drive better than an average human. Why worry about 'no progress for 50 years' at this point?

JoeAltmaier5y ago

Because, they can't invent a new game. Unless of course they were only designed to invent games, and by trial and error and statistical correlation to existing games, thus producing a generic thing that relates to everything but invents nothing.

I'm not an idiot. I understand that we won't have general purpose thinking machines any time soon. But to give up entirely looking into that kind of thing, seems to me to be a mistake. To rebrand the entire field as calculating results to given problems and behaviors using existing mathematical tools, seems to do a disservice to the entire concept and future of artificial intelligence.

Imagine if the field of mathematics were stumped for a while, so investigators decided to just add up things faster and faster, and call that Mathematics.

1 more reply

YeGoblynQueenne5y ago

>> AI as a field relied mostly on 'understanding' based approaches for 50 years without much success. These approaches were too brittle and ungrounded. Why return to something that doesn't work?

To begin with, because they do work and much better than the new approaches in a range of domains. For example, classical planners, automated theorem provers and SAT solvers are still state-of-the-art for their respective problem domains. Statistical techniques can not do any of those things very well, if at all.

Further, because the newer techniques have proven to also be brittle in their own way. Older techniques were "brittle in the sense that they didn't deal with uncertainty very well. Modern techniques are "brittle" because they are incapable of extrapolating from their training data. For example see the "elephant in the room" paper [1] or anything about adversarial examples regarding the brittleness of computer vision (probably the biggest success in modern statistical machine learning).

Finally, AI as a field did not rely on "understanding based approaches for 50 years"; there is no formal definition of "understanding" in the context of AI. A large part of Good, Old-Fashioned AI studied reasoning, which is to say, inference over rules expressed in a logic language, e.g. this was the approach exemplified by expert systems. Another large avenue of research was that on knowledge representation. And of course, machine learning itself was part of the field from its very early days, having been named by Arthur Samuel in 1959. Neural networks themselves are positively ancient: the "artifical neuron" was first described in 1938, by Pitts & McCulloch, many years before "artificial intelligence" was even coined by John McCarthy (and at the time it was a propositional-logic based circuit and nothing to do with gradient optimisation).

In general, all those obsolete dinosaurs of GOFAI could do things that modern systems cannot - for instance, deep neural nets are unrivalled classifiers but cannot do reasoning. Conversely, logic-based AI of the '70s and '80s excelled in formal reasoning. It seems that we have "progressed" by throwing out all the progress of earlier times.

____________

[1] https://arxiv.org/abs/1808.03305

P.S. Image, speech and text generation are cute, but a very poor measure for the progress of the field. There are not even good metrics for them so even saying that deep neural nets can "generate surprisingly good text" doesn't really say anything. What is "surprisingly good text"? Surprising, for whom? Good, according to what? etc. GOFAI folk were often accused of wastig time with "toy" problems, but what exactly is text generation if not a "toy problem" and a total waste of time?

vlmutolo5y ago· 4 in thread

It’s funny when you’ve been thinking for months about how speech recognition could really benefit from integrating models of the human vocal tract…

and then you read this

sqrt175y ago

Here's a thing: incorrect assumptions that are built into a model are more harmful than a model that assumes too little structure. If you model the vocal tract and the actual exciting things are the transient noises that occur when we produce consonants, at best there's lots of work with not much to show and at worst you're limiting your model in a negative way. That's the basis for the "every time we fired a linguist, recognition rates improved" from 90s speech recognition.

On the other end of the spectrum, data and compute ARE limited and for some tasks we're at a point where the model eats up all the humanity's written works and a couple million dollars in compute and further progress has to come from elsewhere because even large companies won't spend billions of dollars in compute and humanity will not suddenly write ten times more blog articles.

visarga5y ago

I think we're far from having used all the media on the internet to train a model. GPT-3 used about 570GB of text (about 50M articles). ImageNet is just 1.5M photos. It's still expensive to ingest the whole YouTube, Google Search and Google Photos in a single model.

And the nice thing about these large models is that you can reuse them with little fine-tuning for all sorts of other tasks. So the industry and any hacker can benefit from these uber-models without having to retrain from scratch. Of course, if they even fit the hardware available, otherwise they have to make due with a slightly lower performance.

1 more reply

PeterisP5y ago

I think that your particular example is very relevant.

Of course a good speech recognition system needs to model all the relevant characteristics of the human vocal tract as such, and of the many different vocal tracts of individual humans!

But this is substantially different from the notion of integrating a human-made model of the human vocal tract.

In this case the bitter lesson (which, as far as I understand, does apply to vocal tract modeling - I don't personally work on speech recognition but colleagues a few doors down do) is that if you start with some data about human voice and biology; you develop some explicit model M, and then integrate it into your system, then it does not work as well if you properly design a system that will learn speech recognition on the whole, learning an implicit model M' of the relevant properties of the vocal tract (and the distribution of these properties in different vocal tracts) as a byproduct of that, given sufficient data.

A hypothesis (which does need more research to be demonstrated, though, but we have some empirical evidence for similar things in most aspects of NLP) on the reason for this is that the human-made model M can't be as good as the learned model because it's restricted by the need to be understandable by humans. It's simplified and regularized and limited in size so that it can be reasonably developed, described, analyzed and discussed by humans - but there's no reason to suppose that the ideal model that would perfectly match reality is simple enough for that; it may well be reducible to a parameteric function that simply has too many parameters to be neatly summarizable to a human-understandable size without simplifying in ways that cost accuracy.

gwern5y ago

"Every time I fire an anatomist and hire a TPU pod, my WER halves."

ksdale5y ago· 2 in thread

I think it's plausible that many technological advances follow a similar. Something like the steam engine is a step-improvement, but many of the subsequent improvements are basically the obvious next step, implemented once steel is strong enough, or machining precise enough, or fuel is refined enough. How many times has the world changed qualitatively, simply in the pursuit of making things quantitatively bigger or faster or stronger?

I can certainly see how it could be considered disappointing that pure intellect and creativity doesn't always win out, but I, personally, don't think it's bitter.

I also have a pet theory that the first AGI will actually be 10,000 very simple algorithms/sensors/APIs duct-taped together running on ridiculously powerful equipment rather than any sort of elegant Theory of Everything, and this wild conjecture may make me less likely to think this a bitter lesson...

StevenWaterman5y ago

I agree, the first AGI probably will be bodget together with loads of expert input. However, that's not evidence against the bitter lesson.

The first of anything is usually made with the help of experts, but they're quickly overtaken by general methods that lever additional computation

ksdale5y ago

Sorry, I didn't mean to suggest that the bitter lesson is wrong, just that it's not bitter, it's actually how a whole bunch of stuff progresses.

1 more reply

fxtentacle5y ago· 2 in thread

The current top contender on AI optical flow uses LESS CPU and LESS RAM than last year's leader. As such, I strongly disagree with the article.

Yes, many AI fields have become better from improved computational power. But this additional computational power has unlocked architectural choices which were previously impossible to execute in a timely manner.

So the conclusion may equally well be that a good network architecture results in a good result. And if you cannot use the right architecture due to RAM or CPU constraints, then you will get bad results.

And while taking an old AI algorithm and re-training it with 2x the original parameters and 2x the data does work and does improve results, I would argue that that's kind of low-level copycat "research" and not advancing the field. Yes, there's a lot of people doing it, but no, it's not significantly advancing the field. It's tiny incremental baby steps.

In the area of optical flow, this year's new top contenders introduce many completely novel approaches, such as new normalization methods, new data representations, new nonlinearities and a full bag of "never used before" augmentation methods. All of these are handcrafted elements that someone built by observing what "bug" needs fixing. And that easily halved the loss rate, compared to last year's architectures, while using LESS CPU and RAM. So to me, that is clear proof of a superior network architecture, not of additional computing power.

jmole5y ago

Yup - and this year's top CPUs have almost 10x the performance per watt of CPUs from even 2-3 years ago [0]

Raw computation is only half the story. The other half is: what the hell do we do with all these extra transistors? [1]

0 - https://www.cpubenchmark.net/power_performance.html

1 - https://youtu.be/Nb2tebYAaOA?t=2167

fxtentacle5y ago

Any day now people will start compiling old programs to web assembly so that you can wrap them with election, instead of compiling them to machine code. Once that happens, we have generated another 3 years of demand for Moore's law X_X

maest5y ago· 2 in thread

For contrast, take this Hofstadter quote:

> This, then, is the trillion-dollar question: Will the approach undergirding AI today—an approach that borrows little from the mind, that’s grounded instead in big data and big engineering—get us to where we want to go? How do you make a search engine that understands if you don’t know how you understand? Perhaps, as Russell and Norvig politely acknowledge in the last chapter of their textbook, in taking its practical turn, AI has become too much like the man who tries to get to the moon by climbing a tree: “One can report steady progress, all the way to the top of the tree.”

My take is that there is something intelectually unsatisfying about solving a problem by simply throwing more computational power at it, instead of trying to understand it better.

Imagine in a parallel universe where computational power is extremely cheap. In this universe, people solve integrals exclusively by numerical integrations so there is no incentive to develop any of the Analysis theory we currently have. I would expect that to be a net negative in the long run as theories like Gen Relativity would be almost impossible to develop without the current mathematical apparatus.

YeGoblynQueenne5y ago

Where is this quote from, please?

To play devil's advocate, I think retort to your comment about "intellectually satisfying" methods is "yeah, but, they work". And in any case, "intellectually satisfying" doesn't have a formal definition in computer science or AI so it can't very well be a goal, as such.

My own concern is exactly what Russel & Norvig seem to say in Hofstadter's comment: by spending all our resources on clmbing the tallest trees to get to the moon, we're falling behind from our goal, of ever getting to the moon. That's even more so if the goal is to use AI to understand our own mind, rather than to beat a bunch of benchmarks.

self5y ago

The quote is from this article:

https://www.theatlantic.com/magazine/archive/2013/11/the-man...

1 more reply

cgearhart5y ago· 2 in thread

I have read this before and broadly agree with the point—it’s no use trying to curate expertise into AI. But I don’t think modeling p(y|x) or it’s friend p(y, x) is the end we’re looking for either. But, it’s unreasonably effective, so we keep doing it. (I don’t have an answer or an alternative; causality appeals to my intuition, but it’s really clunky and has seemingly not paid off.)

sgt1015y ago

Actually I feel like causalities time has come. The framework that has convinced me is just the simple approach of doing controlled experiments over observational data to establish causal links via DAGs no need for any drama!

cgearhart5y ago

It seems to be just shuffling around the hard part of the problem. Causality still depends on some unstructured optimization problem of generating and evaluating causal diagram candidates. I haven’t really seen it applied where the set of potential causal relationships is huge.

francoisp5y ago· 2 in thread

building a model for and with domain knowledge == premature optimization? In the end a win on kaggle or a published paper seems to depend on tweaking hyperparameters based on even more pointed DK: data set knowledge...

I wonder what would be required to build a model that explores the search space of compilable programs in say python that sorts in correct order. Applying this idea of using ML techniques to finding better "thinking" blocks for silicon seems promising.

YeGoblynQueenne5y ago

>> I wonder what would be required to build a model that explores the search space of compilable programs in say python that sorts in correct order.

Oh, not that much. You could do that easily with a small computer and an infinite amount of time.

francoisp5y ago

With Deep Thought and an infinite amount of time, you can get the answer to life the universe and everything...

kdoherty5y ago· 1 in thread

Potentially also of interest is Rod Brooks' response "A Better Lesson" (2019): https://rodneybrooks.com/a-better-lesson/

bnjmn5y ago

"Potentially" is an understatement! A much better take, IMO.

auggierose5y ago· 1 in thread

I guess it depends on what you trying to do. I had a computer vision problem where I was like, hell yeah, let's machine learn the hell out of this. 2 months later, and the results were just not precise enough. It took me 2 more months, and now I am solving the task easily on an iPhone via Apple Metal in milliseconds with a hand-crafted optimisation approach ...

jefft2555y ago

His advice really concerns more scientific research and its long-term progress, and not really immediate applications. I think that injecting human knowledge can lead to faster, more immediate progress, and he seems to believe that too. The "bitter lesson" is that general, data-driven approaches will always win out eventually.

sytse5y ago· 1 in thread

The article says we should focus on increasing the compute we use in AI instead of embedding domain specific knowledge. OpenAI seems to have taken this lesson to heart. They are training a generic model using more compute than anything else.

Many researchers predict a plateau for AI because it is missing the domain specific knowledge but this article and the benefits of more compute that OpenAI is demonstrating beg to differ.

throwaway72815y ago

Model compression is an active research field and will probably be quite lucrative, as you will literally able to save millions.

glitchc5y ago· 1 in thread

When it comes to games, exploitation (of tendencies, weaknesses), misdirection, subterfuge and yomi play a far bigger role in winning than actual skill. Humans are much better than computers at all of those. Perhaps a dubious honour, but an advantage nonetheless. We're only really in trouble when the machine learns to reliably replicate the same tactics.

elcomet5y ago

I think that computers managed to beat humans at poker already. (Online poker, which is different from physical games, where of course AI cannot compete)

KKKKkkkk15y ago· 1 in thread

Today Elon Musk announced that Tesla is going to reach level-5 autonomy by the end of the year. Specifically

There are no fundamental challenges remaining for level-5 autonomy. There are many small problems. And then there's the challenge of solving all those small problems and then putting the whole system together. [0]

I feel like this year is going to be another year in which the proponents of brute-force AI like Elon and Sutton will learn a bitter lesson.

[0] https://twitter.com/yicaichina/status/1281149226659901441

typon5y ago

Elon Musk announcing something doesn't make it true

lambdatronics5y ago· 1 in thread

TL;DR: AI needs a hand up, not a handout. "We want AI agents that can discover like we can, not which contain what we have discovered." I was internally protesting all the way through the note, until I got to that penultimate sentence.

rbecker5y ago

Yeah, it takes a careful, charitable reading to not interpret it as "don't bother with understanding or finding new methods, just throw more FLOPS at it".

astrophysician5y ago

I think what he's basically saying is that priors (i.e. domain knowledge + custom, domain-inspired models) help when you're data limited or when your data is very biased, but once that's not the case (e.g. we have an infinite supply of voice samples), model capacity is usually all that matters.

dyukqu5y ago

Previous discussion: https://news.ycombinator.com/item?id=19393432

koeng5y ago

This lesson can be applied to synthetic biology right now, though it is still in its infant stages.

At least a few of the original synthetic biologists are a bit disappointed in the rise of high-throughput testing for everything, instead of "robust engineering". Perhaps what allows us to understand life isn't just more science, but more "biotech computation".

ruuda5y ago

A slightly more recent post, that really opened my eyes to this insight (and references The Bitter Lesson) is this piece by Gwern on the scaling hypothesis: https://www.gwern.net/newsletter/2020/05#gpt-3

throwaway72815y ago

This reminds me of the Banko and Brill paper "Scaling to very very large corpora for natural language disambiguation" - https://dl.acm.org/doi/10.3115/1073012.1073017.

It is exactly the point and it is something not a lot of researchers really grok. As a researcher you are so smart, why can't you discover whatever you are seeking? I think in this decade, we see a couple more scientific discoveries by brute force which will hopefully will make the scientific type a bit more humble an honest.

coldtea5y ago

>At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess.

This seems problematic as a concept in itself.

Sure human players have a "human understanding of the special structure of chess". But what makes them play could be an equally "deep search" and fuzzy computations done in the brain that and not some conscious step by step reasoning. Or rather, their "conscious step by step reasoning" to my opinion probably relies on tops on subconscious deep search in the brain for pruning the possible moves, etc.

I don't think anybody plays chess at any great level merely by making conscious step by step decisions.

Similar to how when we want to catch a ball thrown at us, we do some thinking like "they threw it to our right, so we better move right" but we also have tons of subconscious calculations of the trajectory (nobody sits and explicitly calculates the parabolic formula when they're thrown a baseball).

overhyp5y ago

I would like to offer what I believe is a counterpoint, but I am not a trained ML researcher so I am not sure if it is even a counter-point. Maybe it is just an observation.

I recently participated in the following Kaggle competition:

https://www.kaggle.com/allen-institute-for-ai/CORD-19-resear...

Now, you can see the kinds of questions the contest expects the ML to answer, just to take an example:

"Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings"

All I can say is that the contest results, on the whole, were completely underwhelming. You can check out the Contributions page to verify this for yourself. If the consequences of the failure weren't so potentially catastrophic, some might even call it a little comical. I mean, its not as if a pandemic comes around every few months, so we can all just wait for the computational power to catch up to solve these problems like the author suggests.

Also, I couldn't help but feel that nearly all participants were more interested in applying the latest and greatest ML advancement (Bert QA!), often with no regard to the problem which was being solved.

I wish I could tell you I have some special insight into a better way to solve it, given that there is a friggin pandemic going on, and we could all very well do with some real friggin answers! I don't have any such special insight at all. All I found out was that everyone was so obsessed with using the latest and greatest ML techniques, that there was practically no first principles thinking. At the end, everyone just sort of got too drained and gave up, which is reflected by a single participant winning pretty much the entire second round of 7-8 task prizes by the virtue of being the last man standing :-)

I have realized two things.

1) ML, at least when it comes to understanding text, is really overhyped

2) Nearly everyone who works in ML research is probably overpaid by a factor of 100 (just pulling some number out of my you know what), given that the results they have actually produced have fallen so short precisely when they were so desperately needed

sidpatil5y ago

http://norvig.com/chomsky.html

avmich5y ago

> When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers.

It's like calling Russia a loser in Cold War. Technically the effect is reached; practically the side which "lost" gained possibly largest benefits.

annoyingnoob5y ago

That is a wall of words, I can't even read it in that format.

totally_a_human5y ago

This page seems to be down. Is there a mirror?

mtgp10005y ago

>We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

I think these lessons are less appropriate as our hardware and our understanding of neural networks improve. An agent which is able to [self] learn complex probabilistic relationships between inputs and outputs (i.e. heuristics) requires a minimum complexity/performance, both in hardware and neural network design, before any sort of useful[self] learning is possible. We've only recently crossed that threshold (5-10 years ago)

>The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin

Admittedly, I'm not quite sure of the author's point. They seem to indicate that there is a trade-off between spending time optimizing the architecture and baking in human knowledge.

If that's the case, I would argue that there is an impending perspective shift in the field of ML, wherein "human knowledge" is not something to hardcode explicitly, but instead is implicitly delivered through a combination of appropriate data curation and design of neural networks which are primed to learn certain relationships.

That's the future and we're just collectively starting down that path - it will take some time for the relevant human knowledge to accumulate.

j / k navigate · click thread line to collapse

85 comments

71 comments · 27 top-level

aszen5y ago· 10 in thread

noanabeshima5y ago

The amount of compute used in the largest AI training runs has been exponentially growing:

https://openai.com/blog/ai-and-compute/

The amount of compute required for Imagenet classification has been exponentially decreasing:

https://openai.com/blog/ai-and-efficiency/

aglionby5y ago

True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.

[1] https://thegradient.pub/nlp-imagenet/

[2] https://openreview.net/pdf?id=r1xMH1BtvB

aszen5y ago

Very interesting links, thanks for sharing.

I'm curious how long this trend will continue and if there's anything promising that can reverse this trend

1 more reply

abetusk5y ago

Moore's law might be dead but the deeper law is still alive.

Moore's law is technically "the number of transistors per unit area doubles every 24 months" [1]. The more important law is that the cost of transistors halves every 18-24 months.

There's also Wright's law that comes into play [3] that talks about dropping exponential costs just from institutional knowledge (2x in production leads to (.75-.9)x in cost).

[1] https://en.wikipedia.org/wiki/Moore%27s_law

[2] https://www.youtube.com/watch?v=Nb2tebYAaOA

[3] https://en.wikipedia.org/wiki/Experience_curve_effects

aszen5y ago

But as mentioned in the comments below ai model training is increasing exponentially (compute required to train models has been doubling every 3.6 months) so it still far outstrips the cost savings.

edjrage5y ago

It really irks me that these things are called "laws". A law is something we expect to hold true forever, by means of the hypothetico-deductive scientific method.

PeterisP5y ago

aszen5y ago

hpoe5y ago

nessunodoro5y ago

it's kind of poetic that the chief bottleneck of advancement in the field is now the physical universe -

YeGoblynQueenne5y ago· 9 in thread

>> In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.

No, no. The biggest question in the field is not one that is answered by "a deeper search". The biggest question is "how can we do that without a search"?

Also see Rodney Brook's "better lesson" [2] addressing the other successes of big search discussed in the article.

_____________

[1] https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)#Des...

[2] https://rodneybrooks.com/a-better-lesson/

mtgp10005y ago

I think the answer is heuristics based on priors(e.g. board state), which we've demonstrated (with alphago and derivatives, especially alphago zero) that neural networks are readily able to learn.

This is why I get the impression that modern neural networks are quickly approaching humanlike reasoning - once you figure out how to

(1) encode (or train) heuristics and

(2) encode relationships between concepts in a manner which preserves a sort of topology (think for example of a graph where nodes represent generic ideas)

You're well on your way to artificial general reasoning - the only remaining question becomes one of hardware (compute, memory, and/or efficiency of architecture).

dreamcompiler5y ago

YeGoblynQueenne5y ago

>> Are we certain that well-trained human players are not doing search?

>> It's possible that a search subnetwork gets "compiled without debugger symbols" and the owner of the brain is simply unaware that it's happening.

Sorry, I don't understand what you mean.

2 more replies

gwern5y ago

2 more replies

bsder5y ago

> Are we certain that well-trained human players are not doing search?

Some, but there's a LOT more context pruning the search space.

Watch some of the chess grandmasters play and miss obvious winning moves. Why? "Well, I didn't bother looking at that because <insert famous grandmaster> doesn't just hang a rook randomly."

new26285y ago

At least in chess, if it is not the search, then it is probably the evaluation function.

YeGoblynQueenne5y ago

Whatever human minds do, computing is only a very general metaphor for it and it's very risky to assume we understand anything about our mind just because we understand our computers.

burntoutfire5y ago

> No, no. The biggest question in the field is not one that is answered by "a deeper search". The biggest question is "how can we do that without a search"?

blt5y ago

JoeAltmaier5y ago· 5 in thread

In fact, is the researcher supposed to be building the most performant solution? This article seems alarmingly misinformed. To understand 'artificial intelligence' isn't a race to VC money.

otoburb5y ago

>>This article seems alarmingly misinformed.

[1] https://en.wikipedia.org/wiki/Richard_S._Sutton

[2] http://incompleteideas.net/papers/sutton-88-with-erratum.pdf

JoeAltmaier5y ago

He's writing that article for a reason, to be sure. Its just not the one that the article says its about, I'm thinking.

visarga5y ago

AI as a field relied mostly on 'understanding' based approaches for 50 years without much success. These approaches were too brittle and ungrounded. Why return to something that doesn't work?

JoeAltmaier5y ago

Imagine if the field of mathematics were stumped for a while, so investigators decided to just add up things faster and faster, and call that Mathematics.

1 more reply

YeGoblynQueenne5y ago

>> AI as a field relied mostly on 'understanding' based approaches for 50 years without much success. These approaches were too brittle and ungrounded. Why return to something that doesn't work?

____________

[1] https://arxiv.org/abs/1808.03305

vlmutolo5y ago· 4 in thread

It’s funny when you’ve been thinking for months about how speech recognition could really benefit from integrating models of the human vocal tract…

and then you read this

sqrt175y ago

visarga5y ago

1 more reply

PeterisP5y ago

I think that your particular example is very relevant.

Of course a good speech recognition system needs to model all the relevant characteristics of the human vocal tract as such, and of the many different vocal tracts of individual humans!

But this is substantially different from the notion of integrating a human-made model of the human vocal tract.

gwern5y ago

"Every time I fire an anatomist and hire a TPU pod, my WER halves."

ksdale5y ago· 2 in thread

I can certainly see how it could be considered disappointing that pure intellect and creativity doesn't always win out, but I, personally, don't think it's bitter.

StevenWaterman5y ago

I agree, the first AGI probably will be bodget together with loads of expert input. However, that's not evidence against the bitter lesson.

The first of anything is usually made with the help of experts, but they're quickly overtaken by general methods that lever additional computation

ksdale5y ago

Sorry, I didn't mean to suggest that the bitter lesson is wrong, just that it's not bitter, it's actually how a whole bunch of stuff progresses.

1 more reply

fxtentacle5y ago· 2 in thread

The current top contender on AI optical flow uses LESS CPU and LESS RAM than last year's leader. As such, I strongly disagree with the article.

jmole5y ago

Yup - and this year's top CPUs have almost 10x the performance per watt of CPUs from even 2-3 years ago [0]

Raw computation is only half the story. The other half is: what the hell do we do with all these extra transistors? [1]

0 - https://www.cpubenchmark.net/power_performance.html

1 - https://youtu.be/Nb2tebYAaOA?t=2167

fxtentacle5y ago

maest5y ago· 2 in thread

For contrast, take this Hofstadter quote:

My take is that there is something intelectually unsatisfying about solving a problem by simply throwing more computational power at it, instead of trying to understand it better.

YeGoblynQueenne5y ago

Where is this quote from, please?

self5y ago

The quote is from this article:

https://www.theatlantic.com/magazine/archive/2013/11/the-man...

1 more reply

cgearhart5y ago· 2 in thread

sgt1015y ago

cgearhart5y ago

francoisp5y ago· 2 in thread

YeGoblynQueenne5y ago

>> I wonder what would be required to build a model that explores the search space of compilable programs in say python that sorts in correct order.

Oh, not that much. You could do that easily with a small computer and an infinite amount of time.

francoisp5y ago

With Deep Thought and an infinite amount of time, you can get the answer to life the universe and everything...

kdoherty5y ago· 1 in thread

Potentially also of interest is Rod Brooks' response "A Better Lesson" (2019): https://rodneybrooks.com/a-better-lesson/

bnjmn5y ago

"Potentially" is an understatement! A much better take, IMO.

auggierose5y ago· 1 in thread

jefft2555y ago

sytse5y ago· 1 in thread

Many researchers predict a plateau for AI because it is missing the domain specific knowledge but this article and the benefits of more compute that OpenAI is demonstrating beg to differ.

throwaway72815y ago

Model compression is an active research field and will probably be quite lucrative, as you will literally able to save millions.

glitchc5y ago· 1 in thread

elcomet5y ago

I think that computers managed to beat humans at poker already. (Online poker, which is different from physical games, where of course AI cannot compete)

KKKKkkkk15y ago· 1 in thread

Today Elon Musk announced that Tesla is going to reach level-5 autonomy by the end of the year. Specifically

I feel like this year is going to be another year in which the proponents of brute-force AI like Elon and Sutton will learn a bitter lesson.

[0] https://twitter.com/yicaichina/status/1281149226659901441

typon5y ago

Elon Musk announcing something doesn't make it true

lambdatronics5y ago· 1 in thread

rbecker5y ago

Yeah, it takes a careful, charitable reading to not interpret it as "don't bother with understanding or finding new methods, just throw more FLOPS at it".

astrophysician5y ago

dyukqu5y ago

Previous discussion: https://news.ycombinator.com/item?id=19393432

koeng5y ago

This lesson can be applied to synthetic biology right now, though it is still in its infant stages.

ruuda5y ago

throwaway72815y ago

This reminds me of the Banko and Brill paper "Scaling to very very large corpora for natural language disambiguation" - https://dl.acm.org/doi/10.3115/1073012.1073017.

coldtea5y ago

>At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess.

This seems problematic as a concept in itself.

I don't think anybody plays chess at any great level merely by making conscious step by step decisions.

overhyp5y ago

I would like to offer what I believe is a counterpoint, but I am not a trained ML researcher so I am not sure if it is even a counter-point. Maybe it is just an observation.

I recently participated in the following Kaggle competition:

https://www.kaggle.com/allen-institute-for-ai/CORD-19-resear...

Now, you can see the kinds of questions the contest expects the ML to answer, just to take an example:

"Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings"

I have realized two things.

1) ML, at least when it comes to understanding text, is really overhyped

sidpatil5y ago

http://norvig.com/chomsky.html

avmich5y ago

> When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers.

It's like calling Russia a loser in Cold War. Technically the effect is reached; practically the side which "lost" gained possibly largest benefits.

annoyingnoob5y ago

That is a wall of words, I can't even read it in that format.

totally_a_human5y ago

This page seems to be down. Is there a mirror?

mtgp10005y ago

>We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

>The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin

Admittedly, I'm not quite sure of the author's point. They seem to indicate that there is a trade-off between spending time optimizing the architecture and baking in human knowledge.

That's the future and we're just collectively starting down that path - it will take some time for the relevant human knowledge to accumulate.

j / k navigate · click thread line to collapse