undefined | Better HN

0 pointslittlestymaar1y ago0 comments

No that wasn't your argument and this new one is off course a much waker one that you fell back onto to be “technically right”.

That attention heads are mandatory for transformers is a tautology (without it a transformer is just an MLP…) so of course this statement is going to be correct, by definition.

But when you move the goal post to land on a tautology then you've surrendered your abilities to argue anything and you are just ridiculing yourself. Take this question of your for instance:

> If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.

Which is a legit, non-ridiculous, one.

If you replace it with your later much weaker argument:

> > If you know of any MLP that have had success (even at the GPT-2 level), I'd be interested to know what they are, because I don't know of any.

Then it becomes a dumb question given that MLP have no way of encoding context and can't process sequences of words in the first place.

So when you argue that it was your argument all along, it's particularly embarrassing because you're just arguing that your previous arguments were equally dumb even when they weren't.

That's why I said you're disrespecting your earlier argumentation by retreating to your later tautology.

0 comments

quantadev1y ago

Your ad hominem ratcheted up again. lol. It's ok. No prob. Learn what a tautology is tho bro. It's perfectly legit to discuss how a Transformer would perform if only the Self-Attention part was removed (and everything else kept constant), as an experiment, to refute someone's bizarre claim that the SA part isn't doing the real magic in them. Insofar as the actual other networks you've mentioned they fail to beat Transformers, and will continue to fail, until something analogous to SA is built into them, because language comprehension simply cannot be done without sensitivity to word context, especially over "long ranges" in the input sequences.

littlestymaarOP1y ago

Tautology: a statement that is true by virtue of its logical form alone (Merriam Webster). Fits perfectly the “a neural network that can't process text is less good at processing text that one that can”

> It's perfectly legit to discuss how a Transformer would perform if only the Self-Attention part was removed

It only shows that you don't understand the topic at all (but hey, you talked about closed-form solutions and quantum computing elsewhere in this discussion with others so why I am even surprised…)

> Insofar as the actual other networks you've mentioned they fail to beat Transformers

They don't “fail to beat transformers”, they beat transformers that aren't the state of the art and are less good that the ones that are. And that's not really a surprise given that they are more recent and have much less manpower working on them. I don't expect them to replace transformers until they make some hypothetic breakthrough that'd makes them significantly better than transformers. That's what path dependence is. But they are still a good illustration to the point that you don't need to have attention heads to exhibit the capabilities of LLMs. (Remember you set the bar at GPT-2 level, and they are far beyond that)

> because language comprehension simply cannot be done without sensitivity to word context

And these models actually have a way to represent context so this criticism completely miss the mark. That's really hilarious that you make this kind of claim in an HN thread about SSM. How come you have no idea at all about what a state-space model is and then feels confident enough to come and argue in the comment section…

quantadev1y ago

> I don't expect them to replace transformers until they make some hypothetic breakthrough

Yes, a breakthrough that does what Self-Attention is doing, rather than just scaling up.

1 more reply

j / k navigate · click thread line to collapse

0 pointslittlestymaar1y ago0 comments

No that wasn't your argument and this new one is off course a much waker one that you fell back onto to be “technically right”.

That attention heads are mandatory for transformers is a tautology (without it a transformer is just an MLP…) so of course this statement is going to be correct, by definition.

But when you move the goal post to land on a tautology then you've surrendered your abilities to argue anything and you are just ridiculing yourself. Take this question of your for instance:

> If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.

Which is a legit, non-ridiculous, one.

If you replace it with your later much weaker argument:

> > If you know of any MLP that have had success (even at the GPT-2 level), I'd be interested to know what they are, because I don't know of any.

Then it becomes a dumb question given that MLP have no way of encoding context and can't process sequences of words in the first place.

So when you argue that it was your argument all along, it's particularly embarrassing because you're just arguing that your previous arguments were equally dumb even when they weren't.

That's why I said you're disrespecting your earlier argumentation by retreating to your later tautology.

0 comments

quantadev1y ago

littlestymaarOP1y ago

> It's perfectly legit to discuss how a Transformer would perform if only the Self-Attention part was removed

It only shows that you don't understand the topic at all (but hey, you talked about closed-form solutions and quantum computing elsewhere in this discussion with others so why I am even surprised…)

> Insofar as the actual other networks you've mentioned they fail to beat Transformers

> because language comprehension simply cannot be done without sensitivity to word context

quantadev1y ago

> I don't expect them to replace transformers until they make some hypothetic breakthrough

Yes, a breakthrough that does what Self-Attention is doing, rather than just scaling up.

1 more reply

j / k navigate · click thread line to collapse