Does a motor vehicle get "sleep" when it is serviced? When I reboot a computer, is that equivalent to a nap?
> In animals, the transfer from short-term memory to long-term memory is thought to be supported by hippocampal replay [33], especially during sleep [41]; in this phase, short-term hippocampal memories are reactivated and consolidated into cortical synaptic weights. Sleep makes animals unable to respond to external stimuli, suggesting that it must provide enough cognitive benefit to justify this cost [41]. Inspired by these biological processes, we propose a method for transferring context-window memory into persistent weights. When the model’s context window becomes full during inference, the model enters a “sleep” in which it performs multiple forward passes over the accumulated context and recursively updates its fast weights via a learned local rule. As in animal sleep, the model receives no external input tokens during this phase. After consolidation, the context window is cleared, and the model resumes operation with updated fast weights. During training, the model is optimized end-to-end by backpropagating through the entire process to maximize task performance after sleep.
One thing we do know for certain is that it is necessary, it is needed in "dumb" animals as well as in you and I. If an animal can't sleep it will eventually die.
I don't think that applies to the activity described in the OP. Does their LLM "die" if it can't perform the function described?
I'm autistic and just went through massive changes that basically keep locking up my brain and I freeze.
I learned that sleeping helps, even if just 20 minutes. It helps that I can fall asleep "while awake". It's as if I relinquish control of responding to stimuli, which instantly brings so much rest to my mind. It is odd trying to move a limb and the brain basically responds with noop. But it works.
Afterwards I can generally make a decision and perform it.
So in a sense it seems similar to what you describe the model would have to do. I forget short term concerns that overwhelm and refocus on the long term goal.
i feel like its confusing to reuse the word for a process that aims to deliberately change state of the machine / process
Here the analogy isn't without reason.
Also, even when something is "specific" to humans, it might not be anthropomorphizing to observe it in something else, it could just be an emergent pattern of high intelligence.
Second, explicitly avoiding things that sound like anthropomorphisation is equally not helpful- why avoid a metaphor that works?
Third, it's really a pity that this pointless nitpicking is dominating the thread.
What is dominating the thread are claims that the LLM operation in question is analogous to the function of sleep in humans. It obviously is not.
The anthropomorphization of LLMs has reached ridiculous proportions. Applying the same standards as used in this field to others would result in claims that laundry machines "hallucinated" that they had had sufficient water when they failed due to the faucet being turned off.
I agree we need to be mindful of our metaphores, but they do help both with inspiration for developing techniques as well as for naming things. The onus of keeping bias in check when using metaphores is on the reader, authors can't really do that for you. However once bias is in check you can have a very productive debate in terms of these namings given that everyone is aware of their ontology.
it is very non-helpful—or worse—to use this language, this way.
One might as well say "need neural plasticity" which is as much an analogy and equally misleading and counterproductive in shaping the right model of the system.
One might even call this pernicious, what it encourages is already a social problem; and it doesn't aid understanding, it confounds it.
One of the mayors of New York in the 80's (Koch?) famously doubled the city's bus fleet for zero cost by running them 24 hours, instead of letting them rest at the end of their shifts, as was the previous policy.
Clicking through, that’s exactly what it is. Seems like “sleep” is an excellent term to use here.
Also keep in mind that most if not all devices with a chip have had a function called "sleep" for many years, without this argument.
That's more like a doctor visit and a workout. The sleep will be the part of the duty cycle when it's not operating.
> When I reboot a computer, is that equivalent to a nap?
Yes, it wakes up completely refreshed and in good working order, usually, and if there's still a problem you know you need a technician.
There is a strong, non-trivial connection here between what your brain does in sleep and what they are studying.
You wouldn't object to referring to robot eyes or robot legs.
This is not anything new, its just a word that fits the function.
Do androids dream of electric sheep?
I mean, you do put your computer into "sleep" mode and then "wake" it.
Analogies are useful. I think we need to learn how to continue to benefit from them despite the risk of anthropomorphication.
Maybe someday we'll understand the way our minds work well enough to design from first principles but until then we've only got one template for how a thinking machine should look.
At the very least, we know that sleep and dreaming do exist in biological brains. (Doesn't mean any of it is applicable to artificial neural nets, doesn't mean it'll work for our specific architectures etc. etc., but at least the idea requires fewer assumptions than a completely untested novel theory.)
Essentially it goes "You know how your model can remember its training data? Well, what if you treated its recent context like more training data and updated (some of) the weights using (mostly) the same process used to train it?"
The end result is very good at remembering things but also really good at adapting to new unseen distributions.
Remember Microsoft Tay.
This would create a three-layer memory system:
- Stable long-term memory (initial base weights)
- Mid-term memory built from the compactions and replay buffers
- Short-term memory (KV cache)
Sleeping would just be a fancy term for consolidating and transferring information from one memory layer to another during offline hours. Maybe that's also what the brain does while sleeping.
Also, we wouldn't train on the whole session. A separate critic module, like a reward model, would filter the KV cache to extract the high-value information, like a garbage collector before the LoRA.
That's just an idea though. Right now most research focuses on changing the architecture itself (TITAN, HOPE...) instead.
Felt like stating the obvious there? Greenwich being the center of everything after all.
Biologically humans do similar compression, so introducing a similar concept to an LLM also feels reasonable. Hardware isn't fast/cheap enough to do this on an ongoing basis, similar to how it's too expensive for our brains to do this while we're moving through the world.
All we have now most of the time in LLMs is "working memory" we're missing a lot of the functionality that allows for episodic memory and selective plasticity.
The more you read about how human brains work, the more you realize that we may have figured out a piece with LLMs, but it's certainly nothing approaching AGI. People insisting so are blowing smoke for investor hype or don't understand a big piece of the concepts involved.
That's already possible with LLMs. The challenge is that 1. it would allow permanently jail-breaking models and 2. there'd be no way for them to efficiently transfer what they'd learned to a new model generation.
Coincidentally the human brain is also jailbroken and nontransferable
A more appropriate title would have been something like "Offline Recurrent Memory Consolidation for Long-Context Language Models". This is supposed to be a research paper, not a story book. The title should give context to other researchers, and not be clearly engineered for clicks. If you don't think so, that's your prerogative, but you're objectively wrong.
I do think it points at something bigger than just attention architecture: "memory" isn't just storage, and merely longer context isn't the same thing as having a better understanding of the source data.
I'm looking at this through the "personal AI" lens, where I think the missing "memory" layer seems to be consolidation & prioritization. It's not enough to just pattern match and grab the right emails, notes, etc, stuff them into the context window & hope, but instead it's useful to consider offline processing and turn events into durable state: clusters of observed data becomes episodes, assumptions, contradictions and power confidence for suggestions.
That also pushes up the need for provenance & inspectability. It's going to be interesting to see what kind of memory consolidation strategies are required for each domain use case.
Also not too sure about provenance and inspectability - it is part of memory. If the source is deemed 'important' it will survive forgetting. If not, then maybe not. And its ok. I am sure you dont know the exact source who told you that the capital of France is Paris. You forgot, and its no big deal.
[1]: https://flann.cs.yale.edu
[2]: https://www.cs.toronto.edu/~hinton/csc2535/readings/ws.pdf
[3]: https://arxiv.org/abs/1711.02282
[4]: https://arxiv.org/abs/2006.08381
[5]: https://mural.maynoothuniversity.ie/id/eprint/1653/1/Hamilto...
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.