Also, where is each softmax happening here? For each attention head?
https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08dc5...
I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.
Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.
The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over.
In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result.
Then y is a convex combination of the v_i, and sits in the convex hull of the v_i.
In the context of standard transformer attention, each output lies in the convex hull ("somewhere between") the input values. With the modification of this paper, the input values can be scaled a little so that the output of different heads can be in different "regions" and thus do not interfere with each other (so yes to your third question, the two softmaxes are performed separately for each head).