undefined | Better HN

0 pointsgodelski2y ago0 comments

> It comes from this paper

To clarify, what comes from that paper? The claim that LLMs are zero-shot learners (yes) or the term zero-shot (no[0]).

> I believe the idea is that the LLM was not trained on the task in question

Not quite. We'll see in [0] that the definition is

>> We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes.

To clarify, this means that their goal is to obtain a classifier f:X → Y but that they train f':X → Z, where Z ⊂ Y. You then test this by performing f':X → A where A ⊂ Z and A ⊄ Z. To make clearer, their experiments classify 60 words such as bear, dog, cat, truck, car, airplane. You'll notice there are two metaclasses here (there are more): animals and vehicles. The second dataset included 128 _semantic_ features (e.g. size/shape/surface properties/usage) about the previous words and that's what they tested against. Notice how the abstraction level increases. Note that Z ⊂ A is acceptable, but not the other way around; this should clarify my LAION -> ImageNet example. The reason that this is important is because zero-shot is telling us about the model's ability to generalize, as the model learns additional and _abstracted_ discriminating boundaries within the data than were explicitly trained for. It is not very informative to learn that a model can perform a subset of its trained task (see CIFAR-5 example in sibling comment) -- though this can still be interesting but for other reasons. I should mention that there is a "transductive setting" for zero-shot, where unlabeled versions of the novel classes are provided during training but this is explicitly stated when done and there is some contention about the utility of this. This is better referred to as "transductive testing". Generative models also have some contention as density estimators will localize similar data, which is to say that they classify (this is a consequence of the training method and so can be argued that we've explicitly directed the machine to learn this). This relates directly to the transductive point.

For definition of Zero-shot training, I suggest the paper Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly[1] (which you'll note that this predates FLAN by 4 years). I'll make one point though, this work states

> Zero-shot learning assumes disjoint training and test classes

But I don't think that's entirely accurate, as we previously discussed our abstraction case. This is more semantics though and for the case of the dataset they generate it isn't extremely relevant. But the more generalized notion of zero-shot doesn't necessitate disjoint but just that the testing set isn't a subset of the training set (which is always true of the disjoint setting). (Side note: notice that they provide a train/val/test split instead of train/test. This is kinda important) Note that my critique is consistent with another survey work[2] (which also predates FLAN)

> Definition 1.1 (Zero-Shot Learning). Given labeled training instances D^{tr} belonging to the seen classes S, zero-shot learning aims to learn a classifier f^u(·) : X → U that can classify testing instances X^{te} (i.e., to predict Y^{te} ) belonging to the unseen classes U.

As to FLAN, we should mention that the GPT-3[3] work uses quotes around "zero-shot" as they likely recognize its bastardization. But naming things is one of Bambrick's two hard problems. Notice that they also clearly define their usage. You'll notice that FLAN does not do this! My claim about LLMs not being zero-shot learners is how they have actually been trained on all domains that they have been "zero-shot evaluated" on. FLAN gives an example of a "zero-shot" task as: “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” But what you have to ask yourself is if these questions themselves are in the training set, as this would dictate our requirement and if they are they would at best be that "transductive setting," which I think we can now agree is not a great thing to refer to as "zero-shot". The problem is, that these questions are very likely in the trained datasets as those incorporate things like Reddit and HackerNews, where we can definitely find explicit labels to movie reviews as well as some translation tasks (common on language subreddits). That's the issue here. Just because you aren't aware you have trained a model to perform a specific task doesn't mean that you didn't, and thus doesn't mean you actually performed a zero-shot task.

[0] Zero-Shot Learning with Semantic Output Codes https://www.cs.toronto.edu/~hinton/absps/palatucci.pdf

[1] Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly https://arxiv.org/abs/1707.00600

[2] A Survey of Zero-Shot Learning: Settings, Methods, and Applications https://dl.acm.org/doi/10.1145/3293318

[3] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

0 comments

2 comments · 1 top-level

TeMPOraL2y ago· 1 in thread

I appreciate you giving solid overview and references of the use and context of that term - however, in this case, I fear that ship has long sailed by now. The more relevant paper would be[0], the title of which is, I believe, the source of the "LLMs are zero-shot [learners / reasoners / whatever]" phrase that's been reaching meme status, and now shows up in training materials.

[0] - https://arxiv.org/abs/2205.11916 - "Large Language Models are Zero-Shot Reasoners"

godelskiOP2y ago

I'm not convinced that ship has sailed. Really it is just a few players doing it, and mostly Google. The other truth is that this has only been happening for a few years, so it is not hard to "turn back", if you will (because I don't think it has sailed, I think people are abusing the term and using the miscommunication to drive hype and revenue). There aren't many players in the LLM space and people outside this are still using the term correctly. The problem is that the public is picking up the phrase and actively defending it. Which is rather weird, don't you think? That non-researchers are happy to argue about definitions with researchers?

But why I push against this, is because the misuse actually hinders our ability to interpret them. Generalization is one of the most important tasks in ML, and even more so when we start to discuss alignment. What is happening is the erosion of our metrics, which we've already become substantially reliant upon. Google is well known to make generalization claims by doing exactly this -- the JFT -> ImageNet task is quite common. Our field is already extremely noisy, but this just makes it more so. The abuse makes gains seem much larger than they actually are, but in reality all that is happening is that we are decreasing the temperature on the density function.

j / k navigate · click thread line to collapse