undefined | Better HN

0 pointsthomashop11mo ago0 comments

I have the impression that the thinking helps even if the actual content of the thinking output is nonsense. It awards more cycles to the model to think about the problem.

0 comments

6 comments · 1 top-level

wat1000011mo ago· 5 in thread

That would be strange. There's no hidden memory or data channel, the "thinking" output is all the model receives afterwards. If it's all nonsense, then nonsense is all it gets. I wouldn't be completely surprised if a context with a bunch of apparent nonsense still helps somehow, LLMs are weird, but it would be odd.

barrkel11mo ago

This isn't quite right. Even when an LLM generates meaningless tokens, its internal state continues to evolve. Each new token triggers a fresh pass through the network, with attention over the KV cache, allowing the model to refine its contextual representation. The specific tokens may be gibberish, but the underlying computation can still reflect ongoing "thinking".

yorwba11mo ago

Attention operates entirely on hidden memory, in the sense that it usually isn't exposed to the end user. An attention head on one thinking token can attend to one thing and the same attention head on the next thinking token can attend to something entirely different, and the next layer can combine the two values, maybe on the second thinking token, maybe much later. So even nonsense filler can create space for intermediate computation to happen.

Wowfunhappy11mo ago

Wasn't there some study that just telling the LLM to write a bunch of periods first improves responses?

krackers11mo ago

There are several such papers, off the top of my head one is https://arxiv.org/abs/2404.15758

It's a bit more subtle though, if I understand correctly this only works for parallelizable problems. Which makes intuitive sense since the model cannot pass information along with each dot. So in that sense COT can be seen as some form of sampling, which also tracks with findings that COT doesn't boost the "raw intelligence" but rather uncovers latent intelligence, converting pass@k to maj@k. Antirez touches upon this in [1].

On the other hand, I think problems with serial dependencies require "real" COT since the model needs to track the results of subproblems. There's also some studies which show a meta-structure to the COT itself though, e.g. if you look at DeepSeek there are clear patterns of backtracking and such that are slightly more advanced than naive repeated samplings. https://arxiv.org/abs/2506.19143

[1] https://news.ycombinator.com/item?id=44288049

1 more reply

mathiaspoint11mo ago

Eh. The embeddings themselves could act like hidden layer activations and encode some useful information.

j / k navigate · click thread line to collapse