Self-reasoning tokens: teaching models to think ahead (opens in new tab)

(reasoning-tokens.ghost.io)

153 pointsfesens2y ago30 comments

30 comments

28 comments · 8 top-level

jacobsimon2y ago· 5 in thread

I’ve tried similar experiments before by asking the LLM to generate “internal” and “external” dialog, which I think is sort of the same idea at a higher level—-and might be preferable because it would allow for easy introspection vs a new set of tokens? I’m not enough of an expert to understand whether this proposal is intended more for training or inference.

sdenton42y ago

This method is for training. They are using a stop-gradient to 'shield' some tokens from contributing to prediction of the immediate next token, and thus producing a stream of tokens that are only used for longer term prediction.

This is a bit more low level than the usual prompt engineering approaches, and to my mind, a bit more promising. There's more easily measurable results, and I've seen other context where a well placed stop-gradient does wonders...

lucidrains2y ago

yes, it is a stop gradient mask on the attention matrix, iiuc. worth trying

lucidrains2y ago

could even try it with a fraction of the attention heads, instead of introducing new tokens

1 more reply

fesensOP2y ago

The main advantage of using a new and constant token for reasoning is that, while we would pay the full price during training, in the inference phase, we could do most, if not all, the "reasoning" in one shot, without having to feed one generation token at a time.

jacobsimon2y ago

Cool!

XenophileJKO2y ago· 4 in thread

I have definately and frustratingly seen GPT3.5-Turbo do a bunch of anticipation in the outputs.

Basically it will create pre-conditions so that the final output aligns to some bias. In my specific case it was the bias to provide an answer to a question. This is noticable sometimes in chain of thought intermediate outputs. I ended up having to create some space between the entangled decisions in the chain of thought output.

PeterisP2y ago

> In my specific case it was the bias to provide an answer to a question

That seems to be a reasonably expected result of the "instruction post-training" finetuning with RLHF or otherwise. If for some reason you don't want this behavior, you can avoid this by using a model version that just has the core language modeling without that finetuning, e.g. the llama models have such a version available.

XenophileJKO2y ago

Well in this specific case, the logic I was asking the model to do was. (Highly paraphrased..)

1. Inventory the retrieved items.

2. Determine their relevance.

3. Pick the most relevant or if none of the retrieved items is relevant return an alternative message.

What the model will do is add new items into (1) if none of the retrieved items are relevant. If you add some steps between 1 and 2.. it stops doing that.

alt0_2y ago

> definately

relevant xkcd: https://xkcd.com/2871/

XenophileJKO2y ago

If I wanted it spelled correctly I would have run it through the LLM.

wrsh072y ago· 3 in thread

Ok so my understanding: you can have the network generate a token that can be used as input to future token generation along with each output token it generates

These are called reasoning tokens

Initial results with gpt2 are promising

You can generalize this to let the network decide when to generate reasoning tokens (I'm unclear on how). There were also multiple lines in the loss graph with reasoning tokens that I don't quite understand (what's reasoning 1 vs 3? Is it the ratio of reasoning tokens? Something else?)

fesensOP2y ago

Reasoning 1 vs. 3 is the number of reasoning tokens between each "text" token. The 1 reasoning token is exactly what you see in the picture explanation in the article.

The generalization comes from making the network predict a <"start reasoning token"> and end the sequence only when it predicts a <"end reasoning token">. The training dataset for the upcoming experiment contains examples like: """ Q: What is 3+2? A: 3 + 2 is equal to <start reasoning> <reasoning> ... <reasoning> <end reasoning> 5 """

wrsh072y ago

Wasting two tokens on start/end reasoning seems expensive to me (a priori)

I am curious what that would yield though - in some ways that would be the most fun to analyze (when does it think a lot??)

I would also be curious to see at what point you see diminishing returns from reasoning tokens (eg a 1:10 ratio? More?)

pizza2y ago

I'm just speculating here since I don't know what or where the code is but since inference is still autoregressive;

given [a b c] sample [d]

distribution of [d] could be over [reasoning token] | [vocab token]

then at next step you have

[a b c d] and each has an embedding vector associated

so when you go to sample [e] it's a function of [a b c d]

earslap2y ago· 3 in thread

For the existing models is beam-search like methods hopeless due to combinatorial explosion? Are there no smart ways to improve it? Evaluating multiple futures will be slow but if it means that the model can give vastly better output, it might be a worthwhile trade-off in some cases. I feel like our standard way of sampling the output of the LLMs is a bit too simplistic and my hunch is that it should be possible to get a lot more out of them even if it means losing speed.

HarHarVeryFunny2y ago

People are considering that sort of beam-search approach - this is what they call "tree of thoughts" - generate a branching tree of alternate continuations, then pick the best one based on some criteria.

This doesn't seem an ideal approach though, since it amounts to generating a bunch of shallow responses and picking the best, rather than the preferred thinking more deeply before generating. It's not the same as a computer chess program considering N-moves ahead where you are guaranteed that one of those move sequences really is the best one (as long as you don't accidentally prune it out). In contrast, if you generate all possible "shallow" N-token responses (bunch of monkeys gibbering), there is no guarantee any of those will be the high quality response you are hoping for.

Really planning ahead - reasoning deeply before speaking - would seem harder to implement though, since it'd involve applying a variable number of reasoning steps (maybe looping), then determining when to stop. This also seems different from the proposed insertion of "reasoning tokens" since those are shallow reasoning steps (normal single pass through transformer's layers), when it seems what is really needed is more depth of reasoning ("more layers"), perhaps coupled with some working memory/tokens. Both schemes (more tokens vs more depth) are also related to the wish to use a variable amount of compute for different tasks/inputs - less compute for simple tasks, more for hard ones.

earslap2y ago

Ah yes, I totally agree. I was inspecting the method as a stopgap solution (especially because it does not require retraining or any other special tricks) until researchers figure out "planning" in a broader sense. It is very inefficient otherwise, but in the meantime, is just simple sampling with a couple parameters to tune from the output softmax the best we can do? is there no low hanging fruit there?

HarHarVeryFunny2y ago

I suppose the closest alternative to planning ahead (considering alternatives before taking any action - in this case generating tokens) is getting it right the first time, which is only really possible in cases of highly constrained circumstances (prompts) where the model saw enough similar examples to predict the same correct/preferred response. So, to that extent, I suppose better prediction - bigger model, more/better training, etc, reduces the need for planning a bit. Architectural changes, such as adding working memory, that boost predictive power, would also help.

But, yeah, hard to see too many alternatives.

1) Get it right first time (not always possible)

2) Don't plan, but at least consider a bunch of poor alternatives - tree of thoughts

3) Actually implement planning

wantsanagent2y ago· 2 in thread

"The second token, however, duplicates the input of the first one and does not receive a gradient "answer" from the very next token, only from future tokens; ..."

This formulation doesn't make a lot of sense to me.

I get the motivation here but what you're trying to implement is a working memory.

Because transformers have perfect retrospective memory within their context window any generation which can be done directly from input tokens will be.

At any given point a model might want to write to a working memory, but that does not imply that the next non-working-memory-step will supply useful information to better write to working memory in the future. The model also has to be able to decide when to compare the work done in working memory to the next token.

By allowing the model to both exempt output from gradient updates and opt back in to gradient updates, you create a meta-learning loop that could be quite flexible.

sdwr2y ago

As I understand it, this isn't trying to implement actual memory in the form of a cache, but instead some kind of wishy-washy memory-lite.

I'm talking out of my ass here, but I feel like real memory shouldn't be that hard to implement on top of chatGPT. Just run it twice per query, the first time as an internal query that fetches from a memory store.

The budgeting part would be interesting. How many tokens of the main query do you want to fill with memories? And it wouldn't be able to meta learn how to use the system better, you'd have to update the prompt

refulgentis2y ago

I'll call it "not even wrong" :P here, they're putting it in the model, you're describing a common bit of working with LLMs across memory / RAG / etc.

benreesman2y ago· 2 in thread

With a little engineering rigor we could do a push-down automata with semantics Girards-Reynolds constrained around polymorphism.

rullelito2y ago

Utilizing Girard-Reynolds constraints on a polymorphic push-down automata fundamentally misconstrues both computational topology and dynamic system semantics..

benreesman2y ago

The ability to checkpoint the precise state of an agent interaction, bound it above by that context, evaluate within the context is trivially useful. It’s a bit of poetic license maybe to call that a push down automata. What I mean by the analogy is that systems ranging from BERTopic to the Humanify JS de-minifier employ large, positive-temp LLM-style models in bounded ways, for better outcomes, in conjunction with deterministic techniques. In fact, in the case of Humanify, push-down automata are trivially involved via the PLT in the conventional de-compilation.

Girard-Reynolds is again a bit of an analogy though not without plausible concrete application: even post-softmax in a GPT there are useful types one can imagine as being amenable to parametric polymorphism and therefore dictating their implementations.

If your comment is roughly: “that’s not literally sound as stated”, point conceded, it’s a one-sentence allusion to real rigor, not real rigor itself.

Do you find any of that controversial?

exploringBytes2y ago· 1 in thread

First association was to extend the modality of text tokens to concept tokens which could be (logical) relationships. Are you aware of similar works?

pizza2y ago

Not exactly the same game but you might be interested in Mathematical Structure of Syntactic Merge, Marcolli, Chomsky, Berwick (2023).

When we speak we give a string. When we think we don't have to use a string. But we do have to have a functionality to map something that has no single ordering to something that has an ordering (externalization) - a sentence. And vice versa we have a functionality to turn strings into things without a specific ordering (internalization) - thoughts.

radarsat12y ago

Seems very similar to the Think Before You Speak paper.

https://arxiv.org/abs/2310.02226

j / k navigate · click thread line to collapse

30 comments

28 comments · 8 top-level

jacobsimon2y ago· 5 in thread

sdenton42y ago

lucidrains2y ago

yes, it is a stop gradient mask on the attention matrix, iiuc. worth trying

lucidrains2y ago

could even try it with a fraction of the attention heads, instead of introducing new tokens

1 more reply

fesensOP2y ago

jacobsimon2y ago

Cool!

XenophileJKO2y ago· 4 in thread

I have definately and frustratingly seen GPT3.5-Turbo do a bunch of anticipation in the outputs.

PeterisP2y ago

> In my specific case it was the bias to provide an answer to a question

XenophileJKO2y ago

Well in this specific case, the logic I was asking the model to do was. (Highly paraphrased..)

1. Inventory the retrieved items.

2. Determine their relevance.

3. Pick the most relevant or if none of the retrieved items is relevant return an alternative message.

What the model will do is add new items into (1) if none of the retrieved items are relevant. If you add some steps between 1 and 2.. it stops doing that.

alt0_2y ago

> definately

relevant xkcd: https://xkcd.com/2871/

XenophileJKO2y ago

If I wanted it spelled correctly I would have run it through the LLM.

wrsh072y ago· 3 in thread

Ok so my understanding: you can have the network generate a token that can be used as input to future token generation along with each output token it generates

These are called reasoning tokens

Initial results with gpt2 are promising

fesensOP2y ago

Reasoning 1 vs. 3 is the number of reasoning tokens between each "text" token. The 1 reasoning token is exactly what you see in the picture explanation in the article.

wrsh072y ago

Wasting two tokens on start/end reasoning seems expensive to me (a priori)

I am curious what that would yield though - in some ways that would be the most fun to analyze (when does it think a lot??)

I would also be curious to see at what point you see diminishing returns from reasoning tokens (eg a 1:10 ratio? More?)

pizza2y ago

I'm just speculating here since I don't know what or where the code is but since inference is still autoregressive;

given [a b c] sample [d]

distribution of [d] could be over [reasoning token] | [vocab token]

then at next step you have

[a b c d] and each has an embedding vector associated

so when you go to sample [e] it's a function of [a b c d]

earslap2y ago· 3 in thread

HarHarVeryFunny2y ago

earslap2y ago

HarHarVeryFunny2y ago

But, yeah, hard to see too many alternatives.

1) Get it right first time (not always possible)

2) Don't plan, but at least consider a bunch of poor alternatives - tree of thoughts

3) Actually implement planning

wantsanagent2y ago· 2 in thread

"The second token, however, duplicates the input of the first one and does not receive a gradient "answer" from the very next token, only from future tokens; ..."

This formulation doesn't make a lot of sense to me.

I get the motivation here but what you're trying to implement is a working memory.

Because transformers have perfect retrospective memory within their context window any generation which can be done directly from input tokens will be.

By allowing the model to both exempt output from gradient updates and opt back in to gradient updates, you create a meta-learning loop that could be quite flexible.

sdwr2y ago

As I understand it, this isn't trying to implement actual memory in the form of a cache, but instead some kind of wishy-washy memory-lite.

refulgentis2y ago

I'll call it "not even wrong" :P here, they're putting it in the model, you're describing a common bit of working with LLMs across memory / RAG / etc.

benreesman2y ago· 2 in thread

With a little engineering rigor we could do a push-down automata with semantics Girards-Reynolds constrained around polymorphism.

rullelito2y ago

Utilizing Girard-Reynolds constraints on a polymorphic push-down automata fundamentally misconstrues both computational topology and dynamic system semantics..

benreesman2y ago

If your comment is roughly: “that’s not literally sound as stated”, point conceded, it’s a one-sentence allusion to real rigor, not real rigor itself.

Do you find any of that controversial?

exploringBytes2y ago· 1 in thread

First association was to extend the modality of text tokens to concept tokens which could be (logical) relationships. Are you aware of similar works?

pizza2y ago

Not exactly the same game but you might be interested in Mathematical Structure of Syntactic Merge, Marcolli, Chomsky, Berwick (2023).

radarsat12y ago

Seems very similar to the Think Before You Speak paper.

https://arxiv.org/abs/2310.02226

j / k navigate · click thread line to collapse