I'm really wondering when LLM's are going to replace humans for ~all first-pass social media and forum moderation.
Obviously humans will always be involved in coming up with moderation policy and judging gray areas and refining moderation policy... but at what point will LLM's do everything else more reliably than humans?
6 months from now? 3 years from now?
> Obviously humans will always be involved in coming up with moderation policy
This comment seems to respond to:
> Obviously humans will always be involved in [moderation]
The first statement seems to hold true -- at least, it's a more "obvious" conclusion. What scenario is required to fully remove humans from coming up with moderation policy? These companies who are so eager to automate certain tasks will likely still be staffed by humans who would make the decision to automate certain tasks.
This is what is inevitably going to happen. There will be some kind of service provider (probably one of Apple, Google, Microsoft, Amazon) who will verify who you are via official documents such as passport and driving license. When you sign up to a smaller company's service they'll check with the providers to see if you're a genuine person and if so then they'll let you join, if not you'll be blocked. You might be able to use the forum with an anonymous name but the company will always know who you are and if you use your account to spam or abuse people you'll get blacklisted and reported to the police. Any service that doesn't implement the model will be an unusable hell hole of bots and spam.
The internet will splinter into two and you'll have the "verified net" and the "unverified net" with the latter basically becoming a second dark web. To be honest, I think this will probably be a good thing. I think the vast majority of people will spend most of their time on the verified net, which will actually be a more pleasant place to be because people won't be able to get away with what they can now without real consequences in physical reality.
That being said there are plenty of ways it could go wrong - if those accounts get hacked and the owner of the account can't prove it then we could see innocent people going to jail. Or state actors could hack the accounts of citizens they see as problematic and frame them. But all that stuff could happen today anyway - the verification or lack there of doesn't make that much difference but does substantially reduce the use of bots.
I was thinking more of a browser level plugin, in which content from unverified users would be blurred out with a "unverified user - click to view content" type of system. Everything you post will be connected to your identity, so you would be liable for deepfakes and the like. You would also have an activity rating connected to your identity, so other people could see if you are posting 1 piece of content per hour, or 1000.
Maybe a personal media manager connected to the browser so all of the public content that is "signed" by you will be easily viewable by you, and if someone posts something that is not actually yours under your identity somehow, you will be easily be able to rescind the signature.
Just random shower thoughts...
It seems inevitable that some sort of signature or identity proof will be necessary soon to participate in most online forums.
Either esoteric networking between people or straight up government/private entity issued multi factor authentication.
I just really hope we find a way to codify it without scanning people's eyeballs into the blockchain like the guy in charge of the world's first AGI wants to do.
Twitter gets about 500M tweets per day, average tweet is 28 characters. So that’s 14B characters per day. Converting to tokens at around 4 char/token that’s around 3.5B tokens per day. If GPT 3.5 turbo pricing is representative it will cost about $0.0015/thousand tokens which is $5k per day. So it’s possible now.
However, you can probably get that cost down a lot with your own models, which also has the benefit of not being at the mercy of arbitrary API pricing.
imagine Google's general approach to customer service/moderation, but applied all over the place by companies small and large
I shudder at the thought
I don't know, but generally speaking with technological progress, while we lose some things we gain more things. It's important to think not just what technology gets rid of, but what it enables.
It's getting tiring reading all the LLM takes from people here who clearly don't use or understand them at all. So many still stuck in the "predicting next token" nonsense, as if humans don't do that too
How does a human decide which word to use in your mind? Magic?
No, it's a logically based biological/neurological process through which at the end of it, you've decided on a word. They are both forms of computing that can produce largely indistinguishable output... doesn't matter that one is biological and the other isn't
It's among the leading theories in neuroscience for how our brains work https://en.wikipedia.org/wiki/Predictive_coding
How it works — imagine you’re having these sentences:
“Acorn is a tree” and “acorn is an app”
You essentially keep record of all word to word relations internal to a sentence:
- acorn: is, a, an, app, tree Etc.
Now you repeat this for a few gigabytes of text. You’ll end up with a huge map of “word connections”.
You now take the top X words that other words connect to (I.e. 16384). Then you create a vector of 16384 connections, where each word is encoded as 1,0,1,0,1,0,0,0, … (1 is the most connected to word, 0 the second, etc. 1 indicates “is connected” and 0 indicates “no such connection).
You’ll end up with a vector that has a lot of zeroes — you can now sparsify it (I.e. store only the positions of the ones).
You essentially have fingerprints now — what you can do now is to generate fingerprints of entire sentences, paragraphs and texts. Remove the fingerprints of the most common words like “is”, “in”, “a”, “the” etc. and you’ll have a “semantic fingerprint”. Now if you take a lot of example texts and generate fingerprints off it, you can end up with a very small amount of “indices” like maybe 10 numbers that are enough to very reliably identify texts of a specific topic.
Sorry, couldn’t be too specific as I’m on the go - if you’re interested drop me a mail.
We’re using this to categorize literally tens of gigabytes per second with 92% precision into more than 72 categories.
edit: short version: not semantics and not a fingerprint :)
We may only be using 16k vector values but that doesn’t mean we only have a vocab of 16k —- our vocab is more around 1.9 million words each described by a sparse fingerprint of 16k.
The 92% is an average and not the exact accuracy across all categories; the accuracy varies by category as every category is represented by its own filter.
That said, here's some reading material on the underlying ideas:
- https://en.wikipedia.org/wiki/Semantic_folding
- https://arxiv.org/pdf/1511.08855.pdf ("Semantic Folding Theory And its Application in Semantic Fingerprinting")
This is _not_ TF-IDF. Once you have built the "relation fingerprints" of each word, the fingerprint lookup complexity is o(1) as you'll essentially only load a massive LUT of type HashMap<String, Vec<u16>> (or u32 if you go above 255*255). [pro tip: our LUT has the type HashMap<Vec<String>, Vec<u16>> as our impl also considers bigrams, trigrams, quadgrams]
Unfortunately I can't get extremely specific, but we're also feeding these [u8; 16348] vecs into an HTM w/ spatial pooler; feeding one word-SDR aka fingerprint into the HTM at a time allows you to leverage the HTM to make predictions for the most likely next word-SDR aka the fingerprint of the next word -- if you generalise this on a sentence level, you can use the cosine distance between the actual text-SDR aka fingerprint of the next sentence and the predicted text-SDR out of the HTM to semantically segment paragraphs in a continuous stream of text.
This allows us to segment SOCMINT user2user conversations into individual semantically connected packages of text / messages that can be marked by scenario-specific heuristics to be additionally analysed by a downstream system.
But, they are zero/few shot classifiers. Meaning that you can get your classification running and reasonably accurate now, collect data and switch to a fine-tuned very efficient traditional ML model later.
They're very, very good at creating a new, novel solution - but specially trained ML models will rule.
Literally point 3 in the article.
> But, they are zero/few shot classifiers
This is __NOT__ true. Zero-shot means out of domain, and if we're talking about text trained LLMs, there really isn't anything text that is out of domain for them because they are trained on almost anything you can find on the internet. This is not akin to training something on Tiny Shakespeare and then having it perform sentiment analysis (classification) on Sci-Fi novels. Similarly, training a model on JFT or LAION does not give you the ability to perform zero shot classification on datasets like COCO or ImageNet, since the same semantic data exists in both datasets. I don't know why people started using this term to describe the domain adaptation or transfer learning, but it is not okay. Zero-shot requires novel classes, and subsets are not novel.
To clarify, what comes from that paper? The claim that LLMs are zero-shot learners (yes) or the term zero-shot (no[0]).
> I believe the idea is that the LLM was not trained on the task in question
Not quite. We'll see in [0] that the definition is
>> We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes.
To clarify, this means that their goal is to obtain a classifier f:X → Y but that they train f':X → Z, where Z ⊂ Y. You then test this by performing f':X → A where A ⊂ Z and A ⊄ Z. To make clearer, their experiments classify 60 words such as bear, dog, cat, truck, car, airplane. You'll notice there are two metaclasses here (there are more): animals and vehicles. The second dataset included 128 _semantic_ features (e.g. size/shape/surface properties/usage) about the previous words and that's what they tested against. Notice how the abstraction level increases. Note that Z ⊂ A is acceptable, but not the other way around; this should clarify my LAION -> ImageNet example. The reason that this is important is because zero-shot is telling us about the model's ability to generalize, as the model learns additional and _abstracted_ discriminating boundaries within the data than were explicitly trained for. It is not very informative to learn that a model can perform a subset of its trained task (see CIFAR-5 example in sibling comment) -- though this can still be interesting but for other reasons. I should mention that there is a "transductive setting" for zero-shot, where unlabeled versions of the novel classes are provided during training but this is explicitly stated when done and there is some contention about the utility of this. This is better referred to as "transductive testing". Generative models also have some contention as density estimators will localize similar data, which is to say that they classify (this is a consequence of the training method and so can be argued that we've explicitly directed the machine to learn this). This relates directly to the transductive point.
For definition of Zero-shot training, I suggest the paper Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly[1] (which you'll note that this predates FLAN by 4 years). I'll make one point though, this work states
> Zero-shot learning assumes disjoint training and test classes
But I don't think that's entirely accurate, as we previously discussed our abstraction case. This is more semantics though and for the case of the dataset they generate it isn't extremely relevant. But the more generalized notion of zero-shot doesn't necessitate disjoint but just that the testing set isn't a subset of the training set (which is always true of the disjoint setting). (Side note: notice that they provide a train/val/test split instead of train/test. This is kinda important) Note that my critique is consistent with another survey work[2] (which also predates FLAN)
> Definition 1.1 (Zero-Shot Learning). Given labeled training instances D^{tr} belonging to the seen classes S, zero-shot learning aims to learn a classifier f^u(·) : X → U that can classify testing instances X^{te} (i.e., to predict Y^{te} ) belonging to the unseen classes U.
As to FLAN, we should mention that the GPT-3[3] work uses quotes around "zero-shot" as they likely recognize its bastardization. But naming things is one of Bambrick's two hard problems. Notice that they also clearly define their usage. You'll notice that FLAN does not do this! My claim about LLMs not being zero-shot learners is how they have actually been trained on all domains that they have been "zero-shot evaluated" on. FLAN gives an example of a "zero-shot" task as: “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” But what you have to ask yourself is if these questions themselves are in the training set, as this would dictate our requirement and if they are they would at best be that "transductive setting," which I think we can now agree is not a great thing to refer to as "zero-shot". The problem is, that these questions are very likely in the trained datasets as those incorporate things like Reddit and HackerNews, where we can definitely find explicit labels to movie reviews as well as some translation tasks (common on language subreddits). That's the issue here. Just because you aren't aware you have trained a model to perform a specific task doesn't mean that you didn't, and thus doesn't mean you actually performed a zero-shot task.
[0] Zero-Shot Learning with Semantic Output Codes https://www.cs.toronto.edu/~hinton/absps/palatucci.pdf
[1] Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly https://arxiv.org/abs/1707.00600
[2] A Survey of Zero-Shot Learning: Settings, Methods, and Applications https://dl.acm.org/doi/10.1145/3293318
[3] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
Respectfully, i disagree. I have used LLMs on actually novel tasks for which there aren’t any datasets out there. They “get it”.
> I don't know why people started using this term to describe the domain adaptation or transfer learning, but it is not okay. Zero-shot requires novel classes, and subsets are not novel.
Respectfully, i disagree.
Zero-shot is perfectly valid because there is no backpropagation or weight change involved. Causal LLMs are meta-learners due to the attention mechanism and the autoregressive nature of the model. These two change the effective weight of the matrices.
For all sequences of inputs and all possible weights; there exists an instantiation of a neural network without attention that produces identical vectors for the current token given only the previous token.
Do the math, or read the paper “LLMs are meta learners”.
Therefore, for all tasks, giving the model examples of inputs changes its effective weights without actually modifying it, it is perfectly valid for “zero shot learning” because you didn’t do backprop of any kind, you merely did input transformations / preprocessing.
Can you give an example so that we may better discuss or that I can adequately update my understandings? But I will say that simplifying this down to "just trained to predict the next token" is not accurate as it does not account for the differences in architectures and cost functions which dramatically affect this statement due to the differences in their biases. As a clear example, training an image model on likelihood does note guarantee that the model will produce high fidelity samples[0]. But it will be better at imputation or classification. Some other helpful references[1,2]
> Zero-shot is perfectly valid because there is no backpropagation or weight change involved.
I disagree with this. What you have described is still within the broader class of fine tuning. Note that zero-shot is also tuning. I can make this perfectly clear with a simple example that is directly related to my previous argument. ``Suppose we train a model on the CIFAR-10 dataset. Then we "zero-shot" evaluate it on CIFAR-5, where we've just removed 5 random classes.`` I think you'll agree that it should be unsurprising that the model performs well on this second task. This is exactly the "Train on LAION then 'zero-shot' classification on ImageNet" task we commonly see. Subsets are not a clear task change.
> These two change the effective weight of the matrices.
I'm having a difficult time understanding your argument as this directly contradicts your first sentence. I wouldn't even make the lack of weight change a requirement for zero-shot learning as the intent is really that we do not need to directly change. If a model has enough general knowledge and we do not need to modify the parameters explicitly through providing more training (i.e. using a cost function and {back,forward}prop), then this is sufficient (randomly changing parameters, adding non-trainable parameters like activations, or pruning is also acceptable. As well as explicitly what you mentioned). The point comes down to requiring no additional training for __additional domain{,s}__. The training part is not the important part here and not what is in question.
My point is explicitly about claiming that subdomains do not constitute zero-shot learning. If you disagree in what I have claimed are subdomains, then that's a different argument. I'm not arguing against the latter points because that's also not arguing against what I claimed. But I will say that "just because you didn't use backprop doesn't mean it isn't zero-shot" and if you disagree, then note that you have to claim that the CIFAR-5 example is "zero-shot."
Tldr: A -> B doesn't require that B -> A
[0]A note on the evaluation of generative models: https://arxiv.org/abs/1511.01844 (link for also obtaining slides and code: http://theis.io/publications/17/)
Also worth looking at many of the works that cite this one: https://www.semanticscholar.org/paper/A-note-on-the-evaluati...
[1a] Assessing Generative Models via Precision and Recall: https://arxiv.org/abs/1806.00035
[1b] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991
[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026
Side stepping the fact that isn't really how the term is used with models these days, i don't know about that.
Zero shot is defined as being able to output predictions for classes they were not trained on.
It doesn't mean the input data can't be in ml task domain but that the model was not trained on this particular ML task and/or classes.
I'm going to refer you to the sibling comments as they stated similar things and I answered them in depth and do not wish to repeat myself.
But to summarize:
Zero-shot := Goal of f:X → Z but train f':X → Y, where Y ⊂ Z. We test on A⊂Z, where A⊄Y (sometimes definition is A ∩ Y = {∅}, but I'm not being as strict)
> but that the model was not trained on this particular ML task and/or classes.
I'm going to need explicit clarification as to this. Explicitly or implicitly? See sibling comments and note about likelihood and density estimators w.r.t. classification.
It would be nice to see how compares this "complex" approach against a "simple" TF-IDF + RF or SVM.
All I saw were attempts to reproduce some chatgpt output.
If you want to add a new class, to update an existing one, it requires training a new model. Sometimes this is ok and sometimes it's not. This is why we generally prefer a hybrid approach where some classes are using traditional models (BERT based) while others are determined by the LLM.
What I'd like to see more of is systematic comparison between chatgpt and classic models. I was hoping to see a bit of this in this article and was disappointed.
In a future article, we're planning on posting accuracy comparisons as well, but here we want to evaluate a few other architectures for comparison. For example, at 1TPS with 1k tokens, chat-gpt-turbo would cost almost $5k vs a simpler BERT model you could run for under $50.
This is probably very obvious to some people, but a lot of people's first experience with any sort of AI is often an LLM, so this is just the first of many posts we hope to share.
If I had to take a guess, I suspect the LLM might perform a touch better, but we’re taking fractional percent better. Which is fine, if you have the volume, but a wash otherwise
https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...
3.5 (what is used here) is better than crowd workers https://arxiv.org/abs/2303.15056
Per the article: "outperformed the most skilled crowdworkers" on nuanced (but not highly technical) tasks like sentiment labeling.
By definition, it can't outperform the expert ensemble because that's where the gold labels come from.
>By definition, it can't outperform the expert ensemble because that's where the gold labels come from.
The ensemble no but it can outperform an expert trying to solve it. But yes the benchmarks are biased to the experts here.
If you tell it "Return what this says about sharks or nothing if it does not mention them", it will mess up.
User text: "Blah blah ... Sharks ... Surfing ..." Instruction: Return an JSON object containing an array of all sentences in the user text which mention sharks directly or by implication. Response: {"list_of_shark_related_sentences": [
Stop token: ']}'
It'll try to complete the JSON response and it'll try to end it by closing the array and object as shown in the stop token. This severely limits rambling, and if it does add a spurious field it'll (usually) still be valid JSON and you can usually just ignore the unwanted field.
wrt OpenAI, text-davinci-003 handles this well, the other models not so much.
Funnily enough, there is a certain propensity for it to output round numbers (50, 100, etc.) so I have to ask it not to do this and provide examples ("like 27, 63, or 4"). Now that I think about it I should probably randomize those.
My hypothesis here is that due to RLFH, there's likely some implicit learning that tangentially related content is better than no content.
Given that, you'd likely still get better results with your schema being:
"string | null" so the LLM can output a null instead of "" since there is probably not as much training data that gives "" high log prob values.
But we're looking forward to evaluating the functions call, and seeing what the metrics show!
https://letscooktime.com/Blog/ai,/machine/learning,/chatgpt,...
Hopefully this saves you some time!
I had a schema with a string enum property to categorise some inputs. One of the category names was "media/other" or something to that effect. Sometimes the output would stop at just media even though it wasn't a valid option in the schema.
Basically, give the LLM a schema that is loose enough for the LLM to expand where it feels expansion is needed. Saying always "return a number" is super limiting if the LLM has figured out you need a range instead. Saying "always populate this field" is silly because sometimes the field doesn't need to be populated.
It was turning out to be expensive earlier but with optimising the prompt a lot, reduced pricing by OpenAI and now also being able to run Guanaco 13/33B locally has made it even more accessible in terms of pricing for millions of pieces of text.
Now certain problem space may be large enough to require models where the runtime makes it non economical to run it locally, but Ml is still a game of heuristics, see each problem requires some experimentation.
We actually do CPU inference. The SBERT models have a pretty small memory footprint -- you can fit a couple models on a t2.medium instance.
On a C6.Large you can get 75ms inference. T2.medium is more around 100-200ms
LLMs are excellent reasoning engines. But nudging them to the desired output is challenging. They might return categories outside the ones that you determined. They might return multiple categories when you only want one (or the opposite — a single category when you want multiple). Even if you steer the AI toward the correct answer, parsing the output can be difficult. Asking the LLM to output structure data works 80% of the time. But the 20% of the time that your code parses the response fails takes up 99% of your time and is unacceptable for most real-world use cases.
[0] https://twitter.com/mattrickard/status/1678603390337822722
If you're using this to direct messages to approximately the correct department, it doesn't have to be that complicated.
If you're doing this to evaluate customer sentiment, you could probably just select a few hundred messages at random and read them. (There are many "big data" problems which are only big due to not sampling.)