Memorizing Transformers (opens in new tab)

(arxiv.org)

179 pointssilencedogood34y ago32 comments

32 comments

30 comments · 9 top-level

jameshart4y ago· 10 in thread

The ‘ethics’ section seems surprisingly cursory and lacking in references.

“The ability to memorize large databases of facts could have potential ramifications for society, especially if those databases include sensitive personal information or copyrighted works. However, one advantage of using an external memory is that the memory can be easily cleared of all such information”

That’s it? Just ‘may have ramifications’?

No concern that this enables ‘Tay’-like failure modes where a system can be manipulated through input into generating particular output?

Or even just grappling with whether adding ‘memory of experiences’ to a language model might open the door to creating a system that has beliefs, or opinions…? and that maybe there might be some ethical concerns with just wiping that out?

ipsum24y ago

That'd be a waste of space. Most transformer models have the same ethical concerns, which have been addressed in countless other papers. Why bother copy pasting the same essays in every minor tweak of transformers?

dotnet004y ago

The ethics sections for ML papers almost always seem extremely superfluous. It's like asking a CPU designer to talk about the danger that their CPU can run code for computing firing trajectories. It's a paper about providing memory to ML models, it'll have all the possible applications that require memory, what else does one need?

kettleballroll4y ago

The ethics section is a tacked on thing which is required by some large ML conferences. They're essentially a PR stunt. No ML researcher i know cares about it, or devotes more than the 5 minutes it takes to write some platitudes to the task. There are simply no incentives to write this properly. And quite frankly, i don't think there should be. We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout (which, let's face it, would usually require a whole additional paper for most meaningful contributions). I don't really see how we could change this.

Tldr: as a general rule you can ignore the ethics section of ML papers.

6gvONxR4sf7o4y ago

> We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout

That’s the whole problem that led to the introduction of these sections.

1 more reply

balthigor4y ago

> not to think about all potential fallout

You're doing it wrong then.

Ignoring ethics is lazy.

enchiridion4y ago

Yep, this is correct.

YeGoblynQueenne4y ago

>> Tldr: as a general rule you can ignore the ethics section of ML papers.

More generally still, you can ignore the ethics of ML researchers- pretty much for the same reasons that you can ignore the Great Turnip of Justice in the sky.

refulgentis4y ago

I'm not sure it's scientific or helpful to include the risk that a program develops "beliefs" or "opinions", and terminating the program is "wiping [someone] out"

visarga4y ago

> No concern that this enables ‘Tay’ like failure modes where a system can be manipulated through input into generating particular output?

Isn't that the core idea in prompting and few shot learning for large language models?

changoplatanero4y ago

My feeling is that those topics would be best addressed in a separate paper by authors who have more of a background in ethics.

lucidrains4y ago· 4 in thread

have an implementation of this over at https://github.com/lucidrains/memorizing-transformers-pytorc..., for any researcher exploring retrieval and memory with attention networks

knrz4y ago

Dude your repo’s are great, marvellous code quality too for cutting edge papers. Keep it up!

lucidrains4y ago

hey thanks! :^) hope someone makes the next big discovery with them

silencedogood3OP4y ago

Neat! Can you explain what the KNn is doing? I can’t quite follow the paper.

visarga4y ago

It's a sparse attention scheme. They store and reuse activations thus "memorising" the past without the need for training. In order to keep the sequence short enough to fit into memory they only recall the k most similar memories from a much larger log.

6gvONxR4sf7o4y ago· 2 in thread

External memory with pretrained models (or more generally, external not-necessarily-differentiable memory) is one of the most exciting areas of ML right now. It opens up models to external things like facts and databases.

silencedogood3OP4y ago

Can you explain what the big deal is? I’m still in the early learning stages.

6gvONxR4sf7o4y ago

As an example, if you want to encode all of the data in wikipedia with embeddings and train a model to answer questions with that information, historically, that would mean a model that encodes all of wikipedia, encodes the question, uses all of encoded wikipedia to decode an answer, then does backprop through all of that and updates the weights. Then it re-encodes all of wikipedia with the new weights and goes all over again, again and again at each training step, also somehow holding all of that in GPU memory. Meaning you basically couldn’t do it that way.

Today, we’re seeing big models that can encode all of wikipedia in useful ways. If the encodings are “good enough” then you can encode all of wikipedia once, before training another model that just has to encode a question, then use encoded wikipedia to decode an answer, then do backprop through just the answer and question. If wikipedia changes in the meantime, you can probably just update your database of encoded stuff and your learned QA model will be able to incorporate that new information.

1 more reply

shallichange4y ago· 2 in thread

Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime, Laserbeak, Megatron, Astro Train, Jazz

UmbertoNoEco4y ago

People here dont deserve you :)

lukaszkups4y ago

this is what I've been expecting when clicking on this submission

blackbear_4y ago· 2 in thread

> On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Train on test, improved performance on test. Wow.

visarga4y ago

> Wow.

Transformers are very limited in the size of the attention window. They can take a few thousand tokens at maximum. But your data might not fit into the window, and you also don't want to have to fine-tune the model. This paper offers a solution.

spullara4y ago

It isn't being trained on test. Kind of the point of memory is that you can change the memory at will and don't need to train on new information you have never seen before.

tipsytoad4y ago· 1 in thread

Could there be any merit training this on a common-sense dataset such as Cyc?

https://www.lesswrong.com/tag/cyc

ipsum24y ago

Probably not, most common facts (sandcat is a type of feline) are already known by transformers. Maybe some obscure ones.

axg114y ago

See also RETRO, a type of retrieval transformer: [0], [1], [2]

[0] - https://www.deepmind.com/publications/improving-language-mod...

[1] - https://jalammar.github.io/illustrated-retrieval-transformer...

[2] - https://arsham.substack.com/p/retrieval-transformers-for-med...

jerpint4y ago

The basic idea is to have a q,k,v cache of all the previously seen tokens that gets updated over time. The transformer can decide to do self-attention (and ignore the cache) or focus on elements from the cache (enabling it to attend to previously seen tokens). They mainly apply this to large documents, i'd be very curious to see a followup on time-dependent tasks like videos

mountainriver4y ago

Love it! Its seems like a lot of the ideas from reinforcement learning are making their way into transformer land and NLP

j / k navigate · click thread line to collapse

32 comments

30 comments · 9 top-level

jameshart4y ago· 10 in thread

The ‘ethics’ section seems surprisingly cursory and lacking in references.

That’s it? Just ‘may have ramifications’?

No concern that this enables ‘Tay’-like failure modes where a system can be manipulated through input into generating particular output?

ipsum24y ago

dotnet004y ago

kettleballroll4y ago

Tldr: as a general rule you can ignore the ethics section of ML papers.

6gvONxR4sf7o4y ago

> We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout

That’s the whole problem that led to the introduction of these sections.

1 more reply

balthigor4y ago

> not to think about all potential fallout

You're doing it wrong then.

Ignoring ethics is lazy.

enchiridion4y ago

Yep, this is correct.

YeGoblynQueenne4y ago

>> Tldr: as a general rule you can ignore the ethics section of ML papers.

More generally still, you can ignore the ethics of ML researchers- pretty much for the same reasons that you can ignore the Great Turnip of Justice in the sky.

refulgentis4y ago

I'm not sure it's scientific or helpful to include the risk that a program develops "beliefs" or "opinions", and terminating the program is "wiping [someone] out"

visarga4y ago

> No concern that this enables ‘Tay’ like failure modes where a system can be manipulated through input into generating particular output?

Isn't that the core idea in prompting and few shot learning for large language models?

changoplatanero4y ago

My feeling is that those topics would be best addressed in a separate paper by authors who have more of a background in ethics.

lucidrains4y ago· 4 in thread

have an implementation of this over at https://github.com/lucidrains/memorizing-transformers-pytorc..., for any researcher exploring retrieval and memory with attention networks

knrz4y ago

Dude your repo’s are great, marvellous code quality too for cutting edge papers. Keep it up!

lucidrains4y ago

hey thanks! :^) hope someone makes the next big discovery with them

silencedogood3OP4y ago

Neat! Can you explain what the KNn is doing? I can’t quite follow the paper.

visarga4y ago

6gvONxR4sf7o4y ago· 2 in thread

silencedogood3OP4y ago

Can you explain what the big deal is? I’m still in the early learning stages.

6gvONxR4sf7o4y ago

1 more reply

shallichange4y ago· 2 in thread

Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime, Laserbeak, Megatron, Astro Train, Jazz

UmbertoNoEco4y ago

People here dont deserve you :)

lukaszkups4y ago

this is what I've been expecting when clicking on this submission

blackbear_4y ago· 2 in thread

> On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Train on test, improved performance on test. Wow.

visarga4y ago

> Wow.

spullara4y ago

It isn't being trained on test. Kind of the point of memory is that you can change the memory at will and don't need to train on new information you have never seen before.

tipsytoad4y ago· 1 in thread

Could there be any merit training this on a common-sense dataset such as Cyc?

https://www.lesswrong.com/tag/cyc

ipsum24y ago

Probably not, most common facts (sandcat is a type of feline) are already known by transformers. Maybe some obscure ones.

axg114y ago

See also RETRO, a type of retrieval transformer: [0], [1], [2]

[0] - https://www.deepmind.com/publications/improving-language-mod...

[1] - https://jalammar.github.io/illustrated-retrieval-transformer...

[2] - https://arsham.substack.com/p/retrieval-transformers-for-med...

jerpint4y ago

mountainriver4y ago

Love it! Its seems like a lot of the ideas from reinforcement learning are making their way into transformer land and NLP

j / k navigate · click thread line to collapse