To clarify, what comes from that paper? The claim that LLMs are zero-shot learners (yes) or the term zero-shot (no[0]).
> I believe the idea is that the LLM was not trained on the task in question
Not quite. We'll see in [0] that the definition is
>> We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes.
To clarify, this means that their goal is to obtain a classifier f:X → Y but that they train f':X → Z, where Z ⊂ Y. You then test this by performing f':X → A where A ⊂ Z and A ⊄ Z. To make clearer, their experiments classify 60 words such as bear, dog, cat, truck, car, airplane. You'll notice there are two metaclasses here (there are more): animals and vehicles. The second dataset included 128 _semantic_ features (e.g. size/shape/surface properties/usage) about the previous words and that's what they tested against. Notice how the abstraction level increases. Note that Z ⊂ A is acceptable, but not the other way around; this should clarify my LAION -> ImageNet example. The reason that this is important is because zero-shot is telling us about the model's ability to generalize, as the model learns additional and _abstracted_ discriminating boundaries within the data than were explicitly trained for. It is not very informative to learn that a model can perform a subset of its trained task (see CIFAR-5 example in sibling comment) -- though this can still be interesting but for other reasons. I should mention that there is a "transductive setting" for zero-shot, where unlabeled versions of the novel classes are provided during training but this is explicitly stated when done and there is some contention about the utility of this. This is better referred to as "transductive testing". Generative models also have some contention as density estimators will localize similar data, which is to say that they classify (this is a consequence of the training method and so can be argued that we've explicitly directed the machine to learn this). This relates directly to the transductive point.
For definition of Zero-shot training, I suggest the paper Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly[1] (which you'll note that this predates FLAN by 4 years). I'll make one point though, this work states
> Zero-shot learning assumes disjoint training and test classes
But I don't think that's entirely accurate, as we previously discussed our abstraction case. This is more semantics though and for the case of the dataset they generate it isn't extremely relevant. But the more generalized notion of zero-shot doesn't necessitate disjoint but just that the testing set isn't a subset of the training set (which is always true of the disjoint setting). (Side note: notice that they provide a train/val/test split instead of train/test. This is kinda important) Note that my critique is consistent with another survey work[2] (which also predates FLAN)
> Definition 1.1 (Zero-Shot Learning). Given labeled training instances D^{tr} belonging to the seen classes S, zero-shot learning aims to learn a classifier f^u(·) : X → U that can classify testing instances X^{te} (i.e., to predict Y^{te} ) belonging to the unseen classes U.
As to FLAN, we should mention that the GPT-3[3] work uses quotes around "zero-shot" as they likely recognize its bastardization. But naming things is one of Bambrick's two hard problems. Notice that they also clearly define their usage. You'll notice that FLAN does not do this! My claim about LLMs not being zero-shot learners is how they have actually been trained on all domains that they have been "zero-shot evaluated" on. FLAN gives an example of a "zero-shot" task as: “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” But what you have to ask yourself is if these questions themselves are in the training set, as this would dictate our requirement and if they are they would at best be that "transductive setting," which I think we can now agree is not a great thing to refer to as "zero-shot". The problem is, that these questions are very likely in the trained datasets as those incorporate things like Reddit and HackerNews, where we can definitely find explicit labels to movie reviews as well as some translation tasks (common on language subreddits). That's the issue here. Just because you aren't aware you have trained a model to perform a specific task doesn't mean that you didn't, and thus doesn't mean you actually performed a zero-shot task.
[0] Zero-Shot Learning with Semantic Output Codes https://www.cs.toronto.edu/~hinton/absps/palatucci.pdf
[1] Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly https://arxiv.org/abs/1707.00600
[2] A Survey of Zero-Shot Learning: Settings, Methods, and Applications https://dl.acm.org/doi/10.1145/3293318
[3] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165