> This is what confuses me though, people don't write things like: What is the most relevant sentence in this book?
I think it's because this is confusing even researchers. The current explanation for why these models are robust (and accurate even) to data not found in its dataset is that various regularizations are applied to the data during training. There is a 10% random token dropout. The optimizers also apply regularization of sorts via weight decay and othere math tricks I'm not privy to. This consistent regularization means the model will try to overfit but randomly fail. Since the token is occasionally missing, the model instead learns a robust strategy of sorts to handle tropes/cliches/common patterns.
Basically, the idea is that since hte model has seen enough "most relevant sentence..." examples, it actually does indeed begin to grok/model-internally the sort of intent and meaning of those words across a variety of contexts which it has also learned (but it's never seen the combination as in e.g. "relevant sentence in this book"). Modeling this internally may be a waste of parameter space at first, but quickly becomes the most efficient way of storing the information - rather than memorizing every instance used in the dataset, you just call the subset of weights which "understand" how those words are intended to be used.
Since this happens recursively as the generated output gets longer (feeding back into itself), there other such strategies that have been developed are also called upon and the whole thing becomes difficult or impossible to interpret in a meaningful way.
I'm not sure of a whole lot of proof of this, but I see these ideas thrown around a lot. This is also found a lot in biology where cells and multicellular life will experience lots of damage to structure, even down to the DNA, throughout a lifespan or series of lifespans. To account for this, instead of memorizing exactly how to walk with n-numbers-of-limbs dependant on how many you happen to lose; life may instead develop a system which can learn on-the-fly how to walk (or in humans' case, wheelchair) around.
As for your last point about the attention vector - I don't know if it could accurately print its own attention vector. But I do think that it could use those values as a sort of temporary solution for "ranking" perhaps. I don't htink that's what happens in the natural language case of "ranking subjectively the `best` sentence in the article" and still think that is mostly the case of modeling language well and in _many_ domains and modes.