https://www.anthropic.com/research/tracing-thoughts-language...
> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.
At least in my view it's still inherently a next-token predictor, just with really good conditional probability understandings.
It shows that we, computer scientists, think of ourselves as experts on anything. Even though biological machines are well outside our expertise.
We should stop repeating things we don't understand.
All that means is that treating something as a black box doesn't tell you anything about what's inside the box.
Are we just now rediscovering hundred year-old philosophy in CS?
My guess is that they have Claude generate a set of candidate outputs and the Claude chooses the "best" candidate and returns that. I agree this improves the usefulness of the output but I don't think this is a fundamentally different thing from "guessing the next token".
UPDATE: I read the paper and I was being overly generous. It's still just guessing the next token as it always has. This "multi-hop reasoning" is really just another way of talking about the relationships between tokens.
Interpreting the relationship between words as "multi-hop reasoning" is more about changing the words we use to talk about things and less about fundamental changes in the way LLMs work. It's still doing the same thing it did two years ago (although much faster and better). It's guessing the next token.
For a very vacuous sense of "plan ahead", sure.
By that logic, a basic Markov-chain with beam search plans ahead too.