Can you give an example so that we may better discuss or that I can adequately update my understandings? But I will say that simplifying this down to "just trained to predict the next token" is not accurate as it does not account for the differences in architectures and cost functions which dramatically affect this statement due to the differences in their biases. As a clear example, training an image model on likelihood does note guarantee that the model will produce high fidelity samples[0]. But it will be better at imputation or classification. Some other helpful references[1,2]
> Zero-shot is perfectly valid because there is no backpropagation or weight change involved.
I disagree with this. What you have described is still within the broader class of fine tuning. Note that zero-shot is also tuning. I can make this perfectly clear with a simple example that is directly related to my previous argument. ``Suppose we train a model on the CIFAR-10 dataset. Then we "zero-shot" evaluate it on CIFAR-5, where we've just removed 5 random classes.`` I think you'll agree that it should be unsurprising that the model performs well on this second task. This is exactly the "Train on LAION then 'zero-shot' classification on ImageNet" task we commonly see. Subsets are not a clear task change.
> These two change the effective weight of the matrices.
I'm having a difficult time understanding your argument as this directly contradicts your first sentence. I wouldn't even make the lack of weight change a requirement for zero-shot learning as the intent is really that we do not need to directly change. If a model has enough general knowledge and we do not need to modify the parameters explicitly through providing more training (i.e. using a cost function and {back,forward}prop), then this is sufficient (randomly changing parameters, adding non-trainable parameters like activations, or pruning is also acceptable. As well as explicitly what you mentioned). The point comes down to requiring no additional training for __additional domain{,s}__. The training part is not the important part here and not what is in question.
My point is explicitly about claiming that subdomains do not constitute zero-shot learning. If you disagree in what I have claimed are subdomains, then that's a different argument. I'm not arguing against the latter points because that's also not arguing against what I claimed. But I will say that "just because you didn't use backprop doesn't mean it isn't zero-shot" and if you disagree, then note that you have to claim that the CIFAR-5 example is "zero-shot."
Tldr: A -> B doesn't require that B -> A
[0]A note on the evaluation of generative models: https://arxiv.org/abs/1511.01844 (link for also obtaining slides and code: http://theis.io/publications/17/)
Also worth looking at many of the works that cite this one: https://www.semanticscholar.org/paper/A-note-on-the-evaluati...
[1a] Assessing Generative Models via Precision and Recall: https://arxiv.org/abs/1806.00035
[1b] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991
[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026