This is a bit more low level than the usual prompt engineering approaches, and to my mind, a bit more promising. There's more easily measurable results, and I've seen other context where a well placed stop-gradient does wonders...
Basically it will create pre-conditions so that the final output aligns to some bias. In my specific case it was the bias to provide an answer to a question. This is noticable sometimes in chain of thought intermediate outputs. I ended up having to create some space between the entangled decisions in the chain of thought output.
That seems to be a reasonably expected result of the "instruction post-training" finetuning with RLHF or otherwise. If for some reason you don't want this behavior, you can avoid this by using a model version that just has the core language modeling without that finetuning, e.g. the llama models have such a version available.
1. Inventory the retrieved items.
2. Determine their relevance.
3. Pick the most relevant or if none of the retrieved items is relevant return an alternative message.
What the model will do is add new items into (1) if none of the retrieved items are relevant. If you add some steps between 1 and 2.. it stops doing that.
relevant xkcd: https://xkcd.com/2871/
These are called reasoning tokens
Initial results with gpt2 are promising
You can generalize this to let the network decide when to generate reasoning tokens (I'm unclear on how). There were also multiple lines in the loss graph with reasoning tokens that I don't quite understand (what's reasoning 1 vs 3? Is it the ratio of reasoning tokens? Something else?)
The generalization comes from making the network predict a <"start reasoning token"> and end the sequence only when it predicts a <"end reasoning token">. The training dataset for the upcoming experiment contains examples like: """ Q: What is 3+2? A: 3 + 2 is equal to <start reasoning> <reasoning> ... <reasoning> <end reasoning> 5 """
I am curious what that would yield though - in some ways that would be the most fun to analyze (when does it think a lot??)
I would also be curious to see at what point you see diminishing returns from reasoning tokens (eg a 1:10 ratio? More?)
given [a b c] sample [d]
distribution of [d] could be over [reasoning token] | [vocab token]
then at next step you have
[a b c d] and each has an embedding vector associated
so when you go to sample [e] it's a function of [a b c d]
This doesn't seem an ideal approach though, since it amounts to generating a bunch of shallow responses and picking the best, rather than the preferred thinking more deeply before generating. It's not the same as a computer chess program considering N-moves ahead where you are guaranteed that one of those move sequences really is the best one (as long as you don't accidentally prune it out). In contrast, if you generate all possible "shallow" N-token responses (bunch of monkeys gibbering), there is no guarantee any of those will be the high quality response you are hoping for.
Really planning ahead - reasoning deeply before speaking - would seem harder to implement though, since it'd involve applying a variable number of reasoning steps (maybe looping), then determining when to stop. This also seems different from the proposed insertion of "reasoning tokens" since those are shallow reasoning steps (normal single pass through transformer's layers), when it seems what is really needed is more depth of reasoning ("more layers"), perhaps coupled with some working memory/tokens. Both schemes (more tokens vs more depth) are also related to the wish to use a variable amount of compute for different tasks/inputs - less compute for simple tasks, more for hard ones.
But, yeah, hard to see too many alternatives.
1) Get it right first time (not always possible)
2) Don't plan, but at least consider a bunch of poor alternatives - tree of thoughts
3) Actually implement planning
This formulation doesn't make a lot of sense to me.
I get the motivation here but what you're trying to implement is a working memory.
Because transformers have perfect retrospective memory within their context window any generation which can be done directly from input tokens will be.
At any given point a model might want to write to a working memory, but that does not imply that the next non-working-memory-step will supply useful information to better write to working memory in the future. The model also has to be able to decide when to compare the work done in working memory to the next token.
By allowing the model to both exempt output from gradient updates and opt back in to gradient updates, you create a meta-learning loop that could be quite flexible.
I'm talking out of my ass here, but I feel like real memory shouldn't be that hard to implement on top of chatGPT. Just run it twice per query, the first time as an internal query that fetches from a memory store.
The budgeting part would be interesting. How many tokens of the main query do you want to fill with memories? And it wouldn't be able to meta learn how to use the system better, you'd have to update the prompt
Girard-Reynolds is again a bit of an analogy though not without plausible concrete application: even post-softmax in a GPT there are useful types one can imagine as being amenable to parametric polymorphism and therefore dictating their implementations.
If your comment is roughly: “that’s not literally sound as stated”, point conceded, it’s a one-sentence allusion to real rigor, not real rigor itself.
Do you find any of that controversial?
When we speak we give a string. When we think we don't have to use a string. But we do have to have a functionality to map something that has no single ordering to something that has an ordering (externalization) - a sentence. And vice versa we have a functionality to turn strings into things without a specific ordering (internalization) - thoughts.