undefined | Better HN

0 pointsadw2y ago0 comments

Yeah, all of those architectures are _themselves_ hacks to get around having insufficient compute! They absolutely were encoding inductive biases into the network to get around not being able to train enough, and transformers (handwaving hard enough to levitate, the currently-trainable model family with the least inductive bias) have eaten the world in all domains.

This is evidence _for_ the Bitter Lesson, not against it.

0 comments

4 comments · 1 top-level

YeGoblynQueenne2y ago· 3 in thread

They haven't (eaten the world etc). They just happen to be the models that trend hard right now. I bet if you could compare like for like you'd be able to see some improvement in performance from Transformers, but that 'd be extremely hard to separate from the expected improvement from the constantly increasing amounts of data and compute. For example, you could, today, train a much bigger and deeper Multi-Layered Perceptron than you could thirty years ago, but nodoy is trying because that's so 1990's, and in any case they have the data and compute to train much bigger, much more inefficient (contrary to what you say if I got that right) architectures.

Wait a few years and the Next Big Thing in AI will come along, hot on the heels of the next generation of GPUs, or tensor units or whatever the hardware industry can cook up to sell shovels for the gold rush. By then, Transfomers will have hit the plateau of diminishing returns, there'll be gold in them there other hills and nobody would talk of LLMs anymore because that's so 2020s. We've been there so many times before.

adwOP2y ago

> much more inefficient

The tricky part here is that "efficiency" is not a single dimension! Transformers are much more "efficient" in one sense, in that they appear to be able to absorb much more data before they saturate; they're in general less computationally efficient in that you can't exploit symmetries as hard, for example, at implementation time.

Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

Fully-connected NNs are more general than transformers, but they have _so many_ degrees of freedom that the numerical optimization problem is impractical. If someone figures out how to stabilize that training and make these implementable on current or future hardware, you're absolutely right that you'll see people use them. I don't think transformers are magic; you're entirely correct in saying that they're the current knee on the implementability/trainability curve, and that can easily shift given different unit economics.

I think one of the fundamental disconnects here is that people who come at AI from the perspective of logic down think of things very differently to people like me who come at it from thermodynamics _up_.

Modern machine learning is just "applications of maximum entropy", and to someone with a thermodynamics background, that's intuitively obvious (not necessarily correct! just obvious) –in a meaningful sense the _universe_ is a process of gradient descent, so "of course" the answer for some local domain models is maximum-entropy too. In that world view, the higher-order structure is _entirely emergent_. I'm, by training, a crystallographer, so the idea that you can get highly regular structure emerging from merciless application of a single principle is just baked into my worldview very deeply.

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

YeGoblynQueenne2y ago

>> Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

I don't know about that, I'll be honest. Do you have a reference? I suspect it won't disagree with what I'm saying, that neural nets just can't use strong enough bias to avoid overfitting. I didn't say that in so many words, above, but that's the point of having a good inductive bias, that you're not left, as a learner, to the mercy of the data.

>> Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

No that's absolutely a standard assumption in logic :) Think of grammars; like Chomsky likes to say, human language "makes infinite use of finite means" (quoting Wilhelm von Humboldt). Chomsky of course believes that human language is the result of a simple set of rules, very much like logical theories. Personally, I have no idea, but Chomsky consistently and even today pisses off all the linguists and all the machine learning people, so he must be doing something right.

Btw, I'm not coming from the perspective of mathematical logic, only. It's complicated, but, e.g. my MSc was in data science and my PhD in a symbolic form of machine learning. See, learning and logic, or learning and reasoning, are not incompatible, they're fundamentally the same.

adwOP2y ago

> They haven't (eaten the world etc).

To clarify what I mean on this specific bit: the SOTA results in 2D and 3D vision, audio, translation, NLP, etc are all transformers. Past results do not necessarily predict future performance, and it would be absurd to claim that an immutable state of affairs, but it's certainly interesting that all of the domain-specific architectures have been flattened in a very short period of time.

1 more reply

j / k navigate · click thread line to collapse

0 comments

4 comments · 1 top-level

YeGoblynQueenne2y ago· 3 in thread

adwOP2y ago

> much more inefficient

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

YeGoblynQueenne2y ago

>> Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

adwOP2y ago

> They haven't (eaten the world etc).

1 more reply

j / k navigate · click thread line to collapse