Transformers know more than they can tell: Learning the Collatz sequence (opens in new tab)

(arxiv.org)

129 pointsXcelerate5mo ago45 comments

45 comments

This is an interesting paper and I like this kind of mechanistic interpretability work - but I cannot figure out how the paper title "Transformers know more than they can tell" relates to the actual content. In this case what is it that they know and can't tell?

godelski5mo ago

I believe it's a reference to the paper "Language Models (Mostly) Know What They Know".

There's definitely some link but I'd need to give this paper a good read and refresh on the other to see how strong. But I think your final sentence strengthens my suspicion

https://arxiv.org/abs/2207.05221

rikimaru03455mo ago

Ok, I've read the paper and now I wonder, why did they stop at the most interesting part?

They did all that work to figure out that learning "base conversion" is the difficult thing for transformers. Great! But then why not take that last remaining step to investigate why that specifically is hard for transformers? And how to modify the transformer architecture so that this becomes less hard / more natural / "intuitive" for the network to learn?

embedding-shape5mo ago

Why release one paper when you can release two? Easier to get citations if you spread your efforts, and if you're lucky, someone needs to reference both of them.

A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.

Y_Y5mo ago

For interest, this popular pastime goes by several delicious names: https://en.wikipedia.org/wiki/Least_publishable_unit

kkylin5mo ago

:-)

I don't question this decision is sometimes (often) driven by the need to increase publication count. (Which, in turn, happens because people find it esaier to count papers than read them.) But there is a counterpoint here, which is that if you write say a 50-pager (not super common but also not unusual in my area, applied math) and spread several interesting results throughout, odds are good many things in the middle will never see the light of day. Of course one can organize the paper in a way to try to mitigate the effects of this, but sometimes it is better and cleaner to break a long paper into shorter pieces that people can actually digest.

2 more replies

senkora5mo ago

Relevant SMBC: https://www.smbc-comics.com/index.php?db=comics&id=1624

fcharton5mo ago

Author, here. The paper is about the Collatz sequence, how experiments with a transformer can point at interesting facts about a complex mathematical phenomenon, and how, in supervised math transformers, model predictions and errors can be explained (this part is a follow-up to a similar paper about GCD). From a ML research perspective, the interesting (but surprising) take away is the particular way the long Collatz function is learned: "one loop at a time".

To me, the base conversion is a side quest. We just wanted to rule out this explanation for the model behavior. It may be worth further investigation, but it won't be by us. Another (less important) reason is paper length, if you want to submit to peer reviewed outlets, you need to keep pages under a certain number.

godelski5mo ago

I'm curious about 2 things.

1) Why did you not test the standard Collatz sequence? I would think that including that, as well as testing on Z+, Z+\2Z, and 2Z+, would be a bit more informative (in addition to what you've already done). Even though there's the trivial step it could inform how much memorization the network is doing. You do notice the model learns some shortcuts so I think these could help confirm that and diagnose some of the issues.

2) Is there a specific reason for the cross attention?

Regardless, I think it is an interesting paper (these wouldn't be criteria for rejection were I reviewing your paper btw lol. I'm just curious about your thoughts here and trying to understand better)

FWIW I think the side quest is actually pretty informative here, though I agree it isn't the main point.

observationist5mo ago

It might be a side quest, or it could be an elegant way to frame a category of problems that are resistant to the ways in which transformers can learn; in turn, by solving that structural deficiency in order to enable a model to effectively learn that category of problems, you might empower a new leap in capabilities and power.

We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition. It's clear that current architectures aren't going to be the end-all solution, but all we need might simply be a handful of well-posed categorical deficiencies that allow a smooth transition past the current jagged frontiers.

jacquesm5mo ago

> We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition.

That's a pretty bold claim to make.

fiveMoreCents5mo ago

cuz you don't sell nonsense in one piece. it used to be "repeat a lie often enough" ... now lies are split into pieces ...

you'll see more of all that in the next few years.

but if you wanna stay in awe, at your age and further down the road, don't ask questions like you just asked.

be patient and lean into the split.

brains/minds have been FUBARed. all that remains is buying into the fake, all the way down to faking it when your own children get swooped into it all.

"transformers" "know" and "tell" ... and people's favorite cartoon characters will soon run hedge funds but the rest of the world won't get their piece ... this has all gone too far and to shit for no reason.

Onavo5mo ago

Interesting, what about the old proof that neural networks can't model arbitrary length sine waves?

ChadNauseam5mo ago

I don't know that computers can model arbitrary length sine waves either. At least not in the sense of me being able to input any `x` and get `sin(x)` back out. All computers have finite memory, meaning they can only represent a finite number of numbers, so there is some number `x` above which they can't represent any number.

Neural networks are more limited of course, because there's no way to expand their equivalent of memory, while it's easy to expand a computer's memory.

Onavo5mo ago

Here's the paper for your interest

https://arxiv.org/abs/2006.08195

kirubakaran5mo ago

That proof only applies to fixed architecture feed forward multilayer perceptrons with no recurrence, iirc. Transformers are not that.

niek_pas5mo ago

Can someone ELI5 this for a non-mathematician?

robot-wrangler5mo ago

I'll take a shot at it. Using collatz as the specific target for investigating the underlying concepts here seems like a big red-herring that's going to generate lots of confused takes. (I guess it was done partly to have access to tons of precomputed training data and partly to generate buzz. The title also seems kind of poorly chosen and/or misleading)

Really the paper is about mechanistic interpretation and a few results that are maybe surprising. First, the input representation details (base) matters a lot. This is perhaps very disappointing if you liked the idea of "let the models work out the details, they see through the surface features to the very core of things". Second, learning was burst'y with discrete steps, not smooth improvement. This may or may not be surprising or disappointing.. it depends how well you think you can predict the stepping.

esafak5mo ago

The model partially solves the problem but fails to learn the correct loop length:

> An investigation of model errors (Section 5) reveals that, whereas large language models commonly “hallucinate” random solutions, our models fail in principled ways. In almost all cases, the models perform the correct calculations for the long Collatz step, but use the wrong loop lengths, by setting them to the longest loop lengths they have learned so far.

The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture

spuz5mo ago

That's a bit of an uncharitable summary. In bases 8, 12, 16, 24 and 32 their model achieved 99.7% accuracy. They would never expect it to achieve 100% accuracy. It would be like if you trained a model to predict whether or not a given number is prime. A model that was 100% accurate would defy mathematical knowledge but a model that was 99.7% would certainly be impressive.

In this case, they prove that the model works by categorising inputs into a number of binary classes which just happen to be very good predictors for this otherwise random seeming sequence. I don't know whether or not some of these binary classes are new to mathematics but either way, their technique does show that transformer models can be helpful in uncovering mathematical patterns even in functions that are not continuous.

jacquesm5mo ago

A pocket calculator that would give the right numbers 99.7% of the time would be fairly useless. The lack of determinism is a problem and there is nothing 'uncharitable' about that interpretation. It is definitely impressive, but it is fundamentally broken, because when you start making chains of things that are 99.7% correct you end up with garbage after very few iterations. That's precisely why digital computers won out over analog ones, the fact that they are deterministic.

6 more replies

j / k navigate · click thread line to collapse

45 comments

jebarker5mo ago

godelski5mo ago

I believe it's a reference to the paper "Language Models (Mostly) Know What They Know".

There's definitely some link but I'd need to give this paper a good read and refresh on the other to see how strong. But I think your final sentence strengthens my suspicion

https://arxiv.org/abs/2207.05221

rikimaru03455mo ago

Ok, I've read the paper and now I wonder, why did they stop at the most interesting part?

embedding-shape5mo ago

Why release one paper when you can release two? Easier to get citations if you spread your efforts, and if you're lucky, someone needs to reference both of them.

A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.

Y_Y5mo ago

For interest, this popular pastime goes by several delicious names: https://en.wikipedia.org/wiki/Least_publishable_unit

kkylin5mo ago

:-)

2 more replies

senkora5mo ago

Relevant SMBC: https://www.smbc-comics.com/index.php?db=comics&id=1624

fcharton5mo ago

godelski5mo ago

I'm curious about 2 things.

2) Is there a specific reason for the cross attention?

FWIW I think the side quest is actually pretty informative here, though I agree it isn't the main point.

observationist5mo ago

jacquesm5mo ago

> We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition.

That's a pretty bold claim to make.

fiveMoreCents5mo ago

cuz you don't sell nonsense in one piece. it used to be "repeat a lie often enough" ... now lies are split into pieces ...

you'll see more of all that in the next few years.

but if you wanna stay in awe, at your age and further down the road, don't ask questions like you just asked.

be patient and lean into the split.

brains/minds have been FUBARed. all that remains is buying into the fake, all the way down to faking it when your own children get swooped into it all.

Onavo5mo ago

Interesting, what about the old proof that neural networks can't model arbitrary length sine waves?

ChadNauseam5mo ago

Neural networks are more limited of course, because there's no way to expand their equivalent of memory, while it's easy to expand a computer's memory.

Onavo5mo ago

Here's the paper for your interest

https://arxiv.org/abs/2006.08195

kirubakaran5mo ago

That proof only applies to fixed architecture feed forward multilayer perceptrons with no recurrence, iirc. Transformers are not that.

niek_pas5mo ago

Can someone ELI5 this for a non-mathematician?

robot-wrangler5mo ago

esafak5mo ago

The model partially solves the problem but fails to learn the correct loop length:

The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture

spuz5mo ago

jacquesm5mo ago

6 more replies

j / k navigate · click thread line to collapse