As pointed out by a different comment, it's actually the attention we are interested in that is cancelled out *if they are both equal*. This is what the paper mentions in its abstract;
> promoting the emergence of sparse attention patterns
In theory, it is quite clever, and their results seem to back it up.
No comments yet.