Classifying customer messages with LLMs vs traditional ML (opens in new tab)

withinboredom2y ago

Did you just describe AOL??

JustBreath2y ago

The worst part is social media networks aren't necessarily against AI/bot engagement since it greatly fluffs their numbers and keeps their users occupied.

It seems inevitable that some sort of signature or identity proof will be necessary soon to participate in most online forums.

Either esoteric networking between people or straight up government/private entity issued multi factor authentication.

pradn2y ago

Isn’t there a limit to this when one requires an account to be tied to a phone number? Perhaps pseudonymous posting is on a countdown clock.

doliveira2y ago

Ironically for crypto bros, I think the way forward will be to codify the real-world trust structures into the digital world. The future is trustful.

I just really hope we find a way to codify it without scanning people's eyeballs into the blockchain like the guy in charge of the world's first AGI wants to do.

Enginerrrd2y ago

Yeah the ability to astroturf (at massive scale!) product reviews or political opinions as comments in reddit posts and the like will be sort of horrifying. The dead Internet hypothesis may yet come true.

soultrees2y ago

Maybe that’s the method behinds Reddit’s api madness this whole time. (/s). Now only the hugest brands can run their own bots

janalsncm2y ago

Back of the envelope calculation says it could be possible now.

Twitter gets about 500M tweets per day, average tweet is 28 characters. So that’s 14B characters per day. Converting to tokens at around 4 char/token that’s around 3.5B tokens per day. If GPT 3.5 turbo pricing is representative it will cost about $0.0015/thousand tokens which is $5k per day. So it’s possible now.

However, you can probably get that cost down a lot with your own models, which also has the benefit of not being at the mercy of arbitrary API pricing.

zht2y ago

this is some black mirror stuff

imagine Google's general approach to customer service/moderation, but applied all over the place by companies small and large

I shudder at the thought

Xenoamorphous2y ago

I’ve found that it’s pretty much impossible to talk to a person in most customer services in the past few years, it’s always a “robot”. And this has been going since well before LLMs.

ghaff2y ago

Especially with fairly systemic labor shortages, it seems inevitable that we'll see more and more self-service and automation with the corollary that getting an actual human involved will become more difficult.

crazygringo2y ago

Or it could be precisely the opposite -- LLM's take care of all of the easy customer service/moderation, so that it's actually affordable for Google and others to hire high-quality customer reps to manage the hard/urgent stuff that LLM's surface.

I don't know, but generally speaking with technological progress, while we lose some things we gain more things. It's important to think not just what technology gets rid of, but what it enables.

adam_arthur2y ago

They are already sufficient for high level classification... its just a question of cost.

It's getting tiring reading all the LLM takes from people here who clearly don't use or understand them at all. So many still stuck in the "predicting next token" nonsense, as if humans don't do that too

maaanu2y ago

You are seriously telling me that humans predicting word for word when they speak?

adam_arthur2y ago

A system that "predicts the next token" in such a way that it is indistinguishable from a human, is just like a human in practice yes.

How does a human decide which word to use in your mind? Magic?

No, it's a logically based biological/neurological process through which at the end of it, you've decided on a word. They are both forms of computing that can produce largely indistinguishable output... doesn't matter that one is biological and the other isn't

stevenhuang2y ago

Actually yes, architecturally that's the essence of predictive coding.

It's among the leading theories in neuroscience for how our brains work https://en.wikipedia.org/wiki/Predictive_coding

19h2y ago· 19 in thread

We’re classifying gigabytes of intel (SOCMINT / HUMINT) per second and found semantic folding or better in classification quality vs throughput than BERT / LLMs.

How it works — imagine you’re having these sentences:

“Acorn is a tree” and “acorn is an app”

You essentially keep record of all word to word relations internal to a sentence:

- acorn: is, a, an, app, tree Etc.

Now you repeat this for a few gigabytes of text. You’ll end up with a huge map of “word connections”.

You now take the top X words that other words connect to (I.e. 16384). Then you create a vector of 16384 connections, where each word is encoded as 1,0,1,0,1,0,0,0, … (1 is the most connected to word, 0 the second, etc. 1 indicates “is connected” and 0 indicates “no such connection).

You’ll end up with a vector that has a lot of zeroes — you can now sparsify it (I.e. store only the positions of the ones).

You essentially have fingerprints now — what you can do now is to generate fingerprints of entire sentences, paragraphs and texts. Remove the fingerprints of the most common words like “is”, “in”, “a”, “the” etc. and you’ll have a “semantic fingerprint”. Now if you take a lot of example texts and generate fingerprints off it, you can end up with a very small amount of “indices” like maybe 10 numbers that are enough to very reliably identify texts of a specific topic.

Sorry, couldn’t be too specific as I’m on the go - if you’re interested drop me a mail.

We’re using this to categorize literally tens of gigabytes per second with 92% precision into more than 72 categories.

wavemode2y ago

I'd be curious how the output of your approach compares to merely classifying based on what keywords are contained in the text (given that AFAICT you're simply categorizing rather than trying to extract precise meaning).

It’s the same as a giant one hot vector. He’s not describing anything terribly new or impressive, but if it works then god bless and good luck.

SomewhatLikely2y ago

Sounds like TF-IDF vectors.

lgas2y ago

Not to dogpile on all the other "isn't this just" messages, but isn't this just sparse embeddings?

lmeyerov2y ago

Yeah I'm struggling to understand at a fundamental level how this is better both in math + engineering, doubly so by the time you get to sentence embeddings . (Genuinely, it seems to use the same ideas, so curious what the specific trick is vs mature embedding packagings already doing much of this afaict.)

mmcwilliams2y ago

You're not wrong. This sounds curiously close to the ways I've seen word2vec used in production.

espe2y ago

very efficient but also brittle. that must be vast amounts of relatively clean data. you have to magically set the number of top n words to in- and exclude. for most user generated content one would need to heavily normalize the text, e.g. by stemming (to keep in line with the computational austerity). 16384 is very little even if it is neatly seperated concepts. applied to that volume of data it should amount to keyword matching.. that only works if users are basically self-tagging their texts via constrained language use.

edit: short version: not semantics and not a fingerprint :)

19h2y ago

We also trained on all of pushshift and have an average ”unknown” word rate of less than 0.007% — the Reddit corpus is rather amazing to capture pretty much all misspellings of a word.

We may only be using 16k vector values but that doesn’t mean we only have a vocab of 16k —- our vocab is more around 1.9 million words each described by a sparse fingerprint of 16k.

foolswisdom2y ago

I'm curious though, how do you handle related forms of a word (assuming you don't use stemming)? It doesn't seem to me that this process would automatically handle that.

espe2y ago

thanks for the clarification. if your base population is that large then it's frequencies and you get a fingerprint. well done.

dr_kiszonka2y ago

If I understand your approach correctly, you could represent relations between words as graphs and use graph/network similarity measures (of which there are tons) to possibly get over the 92%. (Or not, I have never tried it.)

19h2y ago

Interesting idea! Can you elaborate a bit more?

spyckie22y ago

Just asking, this seems very similar to the attention algorithm that powers LLMs?

- https://en.wikipedia.org/wiki/Semantic_folding

It’s not similar other than that attention relates tokens.

mistrial92y ago

amazing that this streaming pile of characters and its uncreative associations with three-letter-agency code names, results in exactly ninety two percent accuracy.. almost like its profoundly wrong in exactly the most important ways

19h2y ago

Care to elaborate? Not sure why this tone is appropriate.

The 92% is an average and not the exact accuracy across all categories; the accuracy varies by category as every category is represented by its own filter.

LewisDavidson2y ago

Do you have any code that demonstrates this? Sounds super interesting!

19h2y ago

Unfortunately, I can't. We have some projects bubbling around that may see the light of the day eventually but given the myriads of NDAs that stack on top of each other this is rather unlikely.

That said, here's some reading material on the underlying ideas:

- https://arxiv.org/pdf/1511.08855.pdf ("Semantic Folding Theory And its Application in Semantic Fingerprinting")

This is _not_ TF-IDF. Once you have built the "relation fingerprints" of each word, the fingerprint lookup complexity is o(1) as you'll essentially only load a massive LUT of type HashMap<String, Vec<u16>> (or u32 if you go above 255*255). [pro tip: our LUT has the type HashMap<Vec<String>, Vec<u16>> as our impl also considers bigrams, trigrams, quadgrams]

Unfortunately I can't get extremely specific, but we're also feeding these [u8; 16348] vecs into an HTM w/ spatial pooler; feeding one word-SDR aka fingerprint into the HTM at a time allows you to leverage the HTM to make predictions for the most likely next word-SDR aka the fingerprint of the next word -- if you generalise this on a sentence level, you can use the cosine distance between the actual text-SDR aka fingerprint of the next sentence and the predicted text-SDR out of the HTM to semantically segment paragraphs in a continuous stream of text.

This allows us to segment SOCMINT user2user conversations into individual semantically connected packages of text / messages that can be marked by scenario-specific heuristics to be additionally analysed by a downstream system.

mynegation2y ago

They do but it’s probably… classified.

nestorD2y ago· 17 in thread

LLMs are significantly slower than traditional ML, typically costlier and, I have been told, tend to be less accurate than a traditional model trained on a large dataset.

But, they are zero/few shot classifiers. Meaning that you can get your classification running and reasonably accurate now, collect data and switch to a fine-tuned very efficient traditional ML model later.

SkyPuncher2y ago

To me, LLMs feel like "low-code" tools in most applicable domains.

They're very, very good at creating a new, novel solution - but specially trained ML models will rule.

godelski2y ago

> LLMs are significantly slower than traditional ML, typically costlier

Literally point 3 in the article.

> But, they are zero/few shot classifiers

This is __NOT__ true. Zero-shot means out of domain, and if we're talking about text trained LLMs, there really isn't anything text that is out of domain for them because they are trained on almost anything you can find on the internet. This is not akin to training something on Tiny Shakespeare and then having it perform sentiment analysis (classification) on Sci-Fi novels. Similarly, training a model on JFT or LAION does not give you the ability to perform zero shot classification on datasets like COCO or ImageNet, since the same semantic data exists in both datasets. I don't know why people started using this term to describe the domain adaptation or transfer learning, but it is not okay. Zero-shot requires novel classes, and subsets are not novel.

radarsat12y ago

It comes from this paper [0], and I believe the idea is that the LLM was not trained on the task in question, but is able to do it with only instructions (zero shot) or with one or a few examples (few shot). The paper rightfully points out the unexpected fact that the model is only trained to predict the next word and yet can follow arbitrary instructions and perform tasks that it was not explicitly trained to do.

[0]: https://arxiv.org/abs/2109.01652

godelski2y ago

> It comes from this paper

To clarify, what comes from that paper? The claim that LLMs are zero-shot learners (yes) or the term zero-shot (no[0]).

> I believe the idea is that the LLM was not trained on the task in question

Not quite. We'll see in [0] that the definition is

>> We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes.

To clarify, this means that their goal is to obtain a classifier f:X → Y but that they train f':X → Z, where Z ⊂ Y. You then test this by performing f':X → A where A ⊂ Z and A ⊄ Z. To make clearer, their experiments classify 60 words such as bear, dog, cat, truck, car, airplane. You'll notice there are two metaclasses here (there are more): animals and vehicles. The second dataset included 128 _semantic_ features (e.g. size/shape/surface properties/usage) about the previous words and that's what they tested against. Notice how the abstraction level increases. Note that Z ⊂ A is acceptable, but not the other way around; this should clarify my LAION -> ImageNet example. The reason that this is important is because zero-shot is telling us about the model's ability to generalize, as the model learns additional and _abstracted_ discriminating boundaries within the data than were explicitly trained for. It is not very informative to learn that a model can perform a subset of its trained task (see CIFAR-5 example in sibling comment) -- though this can still be interesting but for other reasons. I should mention that there is a "transductive setting" for zero-shot, where unlabeled versions of the novel classes are provided during training but this is explicitly stated when done and there is some contention about the utility of this. This is better referred to as "transductive testing". Generative models also have some contention as density estimators will localize similar data, which is to say that they classify (this is a consequence of the training method and so can be argued that we've explicitly directed the machine to learn this). This relates directly to the transductive point.

For definition of Zero-shot training, I suggest the paper Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly[1] (which you'll note that this predates FLAN by 4 years). I'll make one point though, this work states

> Zero-shot learning assumes disjoint training and test classes

But I don't think that's entirely accurate, as we previously discussed our abstraction case. This is more semantics though and for the case of the dataset they generate it isn't extremely relevant. But the more generalized notion of zero-shot doesn't necessitate disjoint but just that the testing set isn't a subset of the training set (which is always true of the disjoint setting). (Side note: notice that they provide a train/val/test split instead of train/test. This is kinda important) Note that my critique is consistent with another survey work[2] (which also predates FLAN)

> Definition 1.1 (Zero-Shot Learning). Given labeled training instances D^{tr} belonging to the seen classes S, zero-shot learning aims to learn a classifier f^u(·) : X → U that can classify testing instances X^{te} (i.e., to predict Y^{te} ) belonging to the unseen classes U.

As to FLAN, we should mention that the GPT-3[3] work uses quotes around "zero-shot" as they likely recognize its bastardization. But naming things is one of Bambrick's two hard problems. Notice that they also clearly define their usage. You'll notice that FLAN does not do this! My claim about LLMs not being zero-shot learners is how they have actually been trained on all domains that they have been "zero-shot evaluated" on. FLAN gives an example of a "zero-shot" task as: “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” But what you have to ask yourself is if these questions themselves are in the training set, as this would dictate our requirement and if they are they would at best be that "transductive setting," which I think we can now agree is not a great thing to refer to as "zero-shot". The problem is, that these questions are very likely in the trained datasets as those incorporate things like Reddit and HackerNews, where we can definitely find explicit labels to movie reviews as well as some translation tasks (common on language subreddits). That's the issue here. Just because you aren't aware you have trained a model to perform a specific task doesn't mean that you didn't, and thus doesn't mean you actually performed a zero-shot task.

[0] Zero-Shot Learning with Semantic Output Codes https://www.cs.toronto.edu/~hinton/absps/palatucci.pdf

[1] Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly https://arxiv.org/abs/1707.00600

[2] A Survey of Zero-Shot Learning: Settings, Methods, and Applications https://dl.acm.org/doi/10.1145/3293318

[3] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

PartiallyTyped2y ago

> Zero-shot means out of domain, and if we're talking about text trained LLMs, there really isn't anything text that is out of domain for them because they are trained on almost anything you can find on the internet.

Respectfully, i disagree. I have used LLMs on actually novel tasks for which there aren’t any datasets out there. They “get it”.

> I don't know why people started using this term to describe the domain adaptation or transfer learning, but it is not okay. Zero-shot requires novel classes, and subsets are not novel.

Respectfully, i disagree.

Zero-shot is perfectly valid because there is no backpropagation or weight change involved. Causal LLMs are meta-learners due to the attention mechanism and the autoregressive nature of the model. These two change the effective weight of the matrices.

For all sequences of inputs and all possible weights; there exists an instantiation of a neural network without attention that produces identical vectors for the current token given only the previous token.

Do the math, or read the paper “LLMs are meta learners”.

Therefore, for all tasks, giving the model examples of inputs changes its effective weights without actually modifying it, it is perfectly valid for “zero shot learning” because you didn’t do backprop of any kind, you merely did input transformations / preprocessing.

godelski2y ago

> I have used LLMs on actually novel tasks for which there aren’t any datasets out there. They “get it”.

Can you give an example so that we may better discuss or that I can adequately update my understandings? But I will say that simplifying this down to "just trained to predict the next token" is not accurate as it does not account for the differences in architectures and cost functions which dramatically affect this statement due to the differences in their biases. As a clear example, training an image model on likelihood does note guarantee that the model will produce high fidelity samples[0]. But it will be better at imputation or classification. Some other helpful references[1,2]

> Zero-shot is perfectly valid because there is no backpropagation or weight change involved.

I disagree with this. What you have described is still within the broader class of fine tuning. Note that zero-shot is also tuning. I can make this perfectly clear with a simple example that is directly related to my previous argument. ``Suppose we train a model on the CIFAR-10 dataset. Then we "zero-shot" evaluate it on CIFAR-5, where we've just removed 5 random classes.`` I think you'll agree that it should be unsurprising that the model performs well on this second task. This is exactly the "Train on LAION then 'zero-shot' classification on ImageNet" task we commonly see. Subsets are not a clear task change.

> These two change the effective weight of the matrices.

I'm having a difficult time understanding your argument as this directly contradicts your first sentence. I wouldn't even make the lack of weight change a requirement for zero-shot learning as the intent is really that we do not need to directly change. If a model has enough general knowledge and we do not need to modify the parameters explicitly through providing more training (i.e. using a cost function and {back,forward}prop), then this is sufficient (randomly changing parameters, adding non-trainable parameters like activations, or pruning is also acceptable. As well as explicitly what you mentioned). The point comes down to requiring no additional training for __additional domain{,s}__. The training part is not the important part here and not what is in question.

My point is explicitly about claiming that subdomains do not constitute zero-shot learning. If you disagree in what I have claimed are subdomains, then that's a different argument. I'm not arguing against the latter points because that's also not arguing against what I claimed. But I will say that "just because you didn't use backprop doesn't mean it isn't zero-shot" and if you disagree, then note that you have to claim that the CIFAR-5 example is "zero-shot."

Tldr: A -> B doesn't require that B -> A

[0]A note on the evaluation of generative models: https://arxiv.org/abs/1511.01844 (link for also obtaining slides and code: http://theis.io/publications/17/)

Also worth looking at many of the works that cite this one: https://www.semanticscholar.org/paper/A-note-on-the-evaluati...

[1a] Assessing Generative Models via Precision and Recall: https://arxiv.org/abs/1806.00035

[1b] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

https://general-pattern-machines.github.io/

famouswaffles2y ago

>Zero-shot means out of domain, and if we're talking about text trained LLMs, there really isn't anything text that is out of domain for them because they are trained on almost anything you can find on the internet.

Side stepping the fact that isn't really how the term is used with models these days, i don't know about that.

jerrygenser2y ago

It is zero shot. The llm is trained to generate next token.

Zero shot is defined as being able to output predictions for classes they were not trained on.

It doesn't mean the input data can't be in ml task domain but that the model was not trained on this particular ML task and/or classes.

godelski2y ago

> It is zero shot. The llm is trained to generate next token.

I'm going to refer you to the sibling comments as they stated similar things and I answered them in depth and do not wish to repeat myself.

But to summarize:

Zero-shot := Goal of f:X → Z but train f':X → Y, where Y ⊂ Z. We test on A⊂Z, where A⊄Y (sometimes definition is A ∩ Y = {∅}, but I'm not being as strict)

> but that the model was not trained on this particular ML task and/or classes.

I'm going to need explicit clarification as to this. Explicitly or implicitly? See sibling comments and note about likelihood and density estimators w.r.t. classification.

The training classes in this case are words in the vocabulary in the context of a sentence.

hellovaiOP2y ago

That's a great summary and insight. We should likely use that verbiage to help make it more crystal clear :)

famouswaffles2y ago

Current State of the art (GPT-4) is not going to be less accurate than whatever bespoke option you can cook up.

withinboredom2y ago

I wouldn't be so sure of that.

mplewis2y ago

This is absolutely untrue.

famouswaffles2y ago

Feel free to show otherwise

2 more replies

potatoman222y ago

I'd bet good money that well crafted XGBoost models will outperform GPT-4 on almost any classification problem that uses numerical or tabular data. Which, historically, is what most classification problems use.

golergka2y ago

It will be much slower and costlier.

alexmolas2y ago· 12 in thread

Where's the comparison with traditional ML? In the article I only see the good things about using LLM, but there's no mention to traditional ML besides from the title.

It would be nice to see how compares this "complex" approach against a "simple" TF-IDF + RF or SVM.

specproc2y ago

Yeah, my thoughts exactly. If you're running 500k in tokens through through someone else's hallucination-prone computer and paying for the privilege, I want to know why that's any better than something like SetFit.

All I saw were attempts to reproduce some chatgpt output.

hellovaiOP2y ago

SetFit is fairly good, and we do help train SetFit like models for the results you get, however, the issue with SetFit is that its latency and cost benefits come at the price of flexibility.

If you want to add a new class, to update an existing one, it requires training a new model. Sometimes this is ok and sometimes it's not. This is why we generally prefer a hybrid approach where some classes are using traditional models (BERT based) while others are determined by the LLM.

specproc2y ago

I guess use case is everything. There are numerous reasons, not least of which confidentiality, why chatgpt is a no go for me.

What I'd like to see more of is systematic comparison between chatgpt and classic models. I was hoping to see a bit of this in this article and was disappointed.

espe2y ago

+1 for setfit. a baseline that's hard to beat.

famouswaffles2y ago

hellovaiOP2y ago

Thanks Alex, in this article we focused more on deployment comparisons, for example the cost and latency of what it would take to deploy a BERT based model vs LLMs.

In a future article, we're planning on posting accuracy comparisons as well, but here we want to evaluate a few other architectures for comparison. For example, at 1TPS with 1k tokens, chat-gpt-turbo would cost almost $5k vs a simpler BERT model you could run for under $50.

This is probably very obvious to some people, but a lot of people's first experience with any sort of AI is often an LLM, so this is just the first of many posts we hope to share.

jonathankoren2y ago

Yeah, I also find the lack of the comparison suspicious. As is the talk about “hallucinated class labels” being “helpful”.

If I had to take a guess, I suspect the LLM might perform a touch better, but we’re taking fractional percent better. Which is fine, if you have the volume, but a wash otherwise

famouswaffles2y ago

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

famouswaffles2y ago

Current State of the art (GPT-4) is mostly on par with experts and much better than crowdworkers. Might be overkill though.

3.5 (what is used here) is better than crowd workers https://arxiv.org/abs/2303.15056

lisasays2y ago

On par with "experts", no.

Per the article: "outperformed the most skilled crowdworkers" on nuanced (but not highly technical) tasks like sentiment labeling.

By definition, it can't outperform the expert ensemble because that's where the gold labels come from.

famouswaffles2y ago

It's as good or better than experts on 7/18 of those benchmarks. On an additional 4, it's close (within 0.05).

>By definition, it can't outperform the expert ensemble because that's where the gold labels come from.

The ensemble no but it can outperform an expert trying to solve it. But yes the benchmarks are biased to the experts here.

viraptor2y ago

Or even slightly fancy Word2vec/USE or even sentence transformers with clustering that you can trivially run locally rather than a full blown conversational LLM. I'd love to see a large scale comparison.

rossirpaulo2y ago· 11 in thread

This is great! We had a similar thought and couldn't agree more with "LLMs prefer producing something rather than nothing." We have been consistently requesting responses in JSON format, which, despite its numerous advantages, sometimes imposes an obligation for an output even if it shouldn't. This frequently results in hallucinations. Encouraging NULL returns, for example, is a great way to deal with that.

caesil2y ago

I've found that this is best dealt with along two axes with constrained options. i.e., request both a string and a boolean, and if you get boolean false you can simply ignore the string. So when the LLM ignores you and prints a string like "This article does not contain mention of sharks", you can discard that easily.

If you tell it "Return what this says about sharks or nothing if it does not mention them", it will mess up.

LawTalkingGuy2y ago

Have you tried this sort of prompt?

User text: "Blah blah ... Sharks ... Surfing ..." Instruction: Return an JSON object containing an array of all sentences in the user text which mention sharks directly or by implication. Response: {"list_of_shark_related_sentences": [

Stop token: ']}'

It'll try to complete the JSON response and it'll try to end it by closing the array and object as shown in the stop token. This severely limits rambling, and if it does add a spurious field it'll (usually) still be valid JSON and you can usually just ignore the unwanted field.

wrt OpenAI, text-davinci-003 handles this well, the other models not so much.

dontupvoteme2y ago

Making it rank multiple attributes on a scale of 1-10 also works decent in my experience. Then one can simply k-means cluster (or similar) and evaluate the grouping to see how accurate its estimations are

caesil2y ago

Yes, agreed. I'm doing this as well. Works excellently for NLP classifier tasks.

Funnily enough, there is a certain propensity for it to output round numbers (50, 100, etc.) so I have to ask it not to do this and provide examples ("like 27, 63, or 4"). Now that I think about it I should probably randomize those.

https://letscooktime.com/Blog/ai,/machine/learning,/chatgpt,...

galleywest2002y ago

Have you tried using GPT-4s new Function Call feature? The "killer" portion of this is guaranteed JSON based on a schema you pass to the model.

hellovaiOP2y ago

That's a good point! We're actually working on integrating this as well, but in practice, what we've found is that LLM's in general don't like to respond with empty strings for example.

My hypothesis here is that due to RLFH, there's likely some implicit learning that tangentially related content is better than no content.

Given that, you'd likely still get better results with your schema being:

"string | null" so the LLM can output a null instead of "" since there is probably not as much training data that gives "" high log prob values.

But we're looking forward to evaluating the functions call, and seeing what the metrics show!

guhidalg2y ago

I integrated the function calling feature into my personal project and wrote a blog post about it here:

Hopefully this saves you some time!

[0] https://twitter.com/mattrickard/status/1678603390337822722

rolisz2y ago

Nope, it's not guaranteed. They warn you in the OpenAI docs that it might hallucinate inexistent parameters.

Der_Einzige2y ago

Constrained generation should not require calling supplemental functions. It's as simply as banning or reducing the weight of the naughty tokens. There are several libraries which enable this without function calling (microsoft guidance, jsonformer, lmql)

msp262y ago

The output is not 100% guaranteed. Be careful about that and have another layer to check the output.

I had a schema with a string enum property to categorise some inputs. One of the category names was "media/other" or something to that effect. Sometimes the output would stop at just media even though it wasn't a valid option in the schema.

com2kid2y ago

I've run into the same issue, but you can turn it into an advantage if you are careful enough.

Basically, give the LLM a schema that is loose enough for the LLM to expand where it feels expansion is needed. Saying always "return a number" is super limiting if the LLM has figured out you need a range instead. Saying "always populate this field" is silly because sometimes the field doesn't need to be populated.

r_singh2y ago· 1 in thread

I have been using LLMs for ABSA, text classification and even labelling clusters (something that had to be done manually earlier on) and I couldn't be happier.

It was turning out to be expensive earlier but with optimising the prompt a lot, reduced pricing by OpenAI and now also being able to run Guanaco 13/33B locally has made it even more accessible in terms of pricing for millions of pieces of text.

hellovaiOP2y ago

That's very interesting! What sort of direction did you head in with prompt optimization? Was it mostly in shrinking it and then using multi-shot examples? We found that shorter prompts (empirically) perform better than longer prompts.

wilg2y ago· 1 in thread

Classic HN website nitpick: Logo should link to home page. In this case it is a link but just goes to the current page. However, points for being able to easily get to the main product page from the blog, usually that's buried.

hellovaiOP2y ago

oh! Good catch! Fixed this, and will update in the release.

m3kw92y ago· 1 in thread

Prob cheaper with ML but you need training, with transfer learning though you can use a pub trained model and use way less data to train up a classifier like single digit thousands may be ok with 2-5 sentiments

avereveard2y ago

One can use the LLM to generate the label to distill a model to the desired precision, I used that approach and it worked quite well and the model runs locally, including creating the sentence embeddings, faster than the LLM, at a fraction of the cost.

Now certain problem space may be large enough to require models where the runtime makes it non economical to run it locally, but Ml is still a game of heuristics, see each problem requires some experimentation.

YetAnotherNick2y ago· 1 in thread

Interested in knowing how you are running BERT model with $35/month? Cheapest GPU instance costs $200-300/month AFAIK.

aaronvg2y ago

(Other author of this blog post here)

We actually do CPU inference. The SBERT models have a pretty small memory footprint -- you can fit a couple models on a t2.medium instance.

On a C6.Large you can get 75ms inference. T2.medium is more around 100-200ms

rckrd2y ago

I just released a zero-shot classification API built on LLMs https://github.com/thiggle/api. It always returns structured JSON and only the relevant categories/classes out of the ones you provide.

LLMs are excellent reasoning engines. But nudging them to the desired output is challenging. They might return categories outside the ones that you determined. They might return multiple categories when you only want one (or the opposite — a single category when you want multiple). Even if you steer the AI toward the correct answer, parsing the output can be difficult. Asking the LLM to output structure data works 80% of the time. But the 20% of the time that your code parses the response fails takes up 99% of your time and is unacceptable for most real-world use cases.

Animats2y ago

What's the application?

If you're using this to direct messages to approximately the correct department, it doesn't have to be that complicated.

If you're doing this to evaluate customer sentiment, you could probably just select a few hundred messages at random and read them. (There are many "big data" problems which are only big due to not sampling.)

i-am-agi2y ago

Wohoo this is amazing! I have been using the Autolabel (https://news.ycombinator.com/item?id=36409201) library so far for labeling a few classification and question answering datasets and have been seeing some great performance. Would be interested in giving gloo a shot as well to see if it helps performance further. Thanks for sharing this :)

andrewgazelka2y ago

My understanding was training on ChatGPT output was against OpenAI ToS. Is this incorrect for this use case (training BERT)?

caycep2y ago

what's "traditional ML"?

j / k navigate · click thread line to collapse

123 comments

99 comments · 14 top-level

crazygringo2y ago· 22 in thread

This is really interesting.

I'm really wondering when LLM's are going to replace humans for ~all first-pass social media and forum moderation.

6 months from now? 3 years from now?

ghaff2y ago

lcnPylGDnU4H9OF2y ago

> Obviously humans will always be involved in coming up with moderation policy

This comment seems to respond to:

> Obviously humans will always be involved in [moderation]

ghaff2y ago

moffkalast2y ago

The Google motto of 'good enough for most and screw the edge cases'.

woeirua2y ago

rcarr2y ago

> or access is tied directly to your physical identity to access the site).

JimtheCoder2y ago

I have been thinking the same sort of thing as well over the last while.

Just random shower thoughts...

withinboredom2y ago

Did you just describe AOL??

JustBreath2y ago

The worst part is social media networks aren't necessarily against AI/bot engagement since it greatly fluffs their numbers and keeps their users occupied.

It seems inevitable that some sort of signature or identity proof will be necessary soon to participate in most online forums.

Either esoteric networking between people or straight up government/private entity issued multi factor authentication.

pradn2y ago

Isn’t there a limit to this when one requires an account to be tied to a phone number? Perhaps pseudonymous posting is on a countdown clock.

doliveira2y ago

Ironically for crypto bros, I think the way forward will be to codify the real-world trust structures into the digital world. The future is trustful.

I just really hope we find a way to codify it without scanning people's eyeballs into the blockchain like the guy in charge of the world's first AGI wants to do.

Enginerrrd2y ago

soultrees2y ago

Maybe that’s the method behinds Reddit’s api madness this whole time. (/s). Now only the hugest brands can run their own bots

janalsncm2y ago

Back of the envelope calculation says it could be possible now.

However, you can probably get that cost down a lot with your own models, which also has the benefit of not being at the mercy of arbitrary API pricing.

zht2y ago

this is some black mirror stuff

imagine Google's general approach to customer service/moderation, but applied all over the place by companies small and large

I shudder at the thought

Xenoamorphous2y ago

I’ve found that it’s pretty much impossible to talk to a person in most customer services in the past few years, it’s always a “robot”. And this has been going since well before LLMs.

ghaff2y ago

crazygringo2y ago

I don't know, but generally speaking with technological progress, while we lose some things we gain more things. It's important to think not just what technology gets rid of, but what it enables.

adam_arthur2y ago

They are already sufficient for high level classification... its just a question of cost.

maaanu2y ago

You are seriously telling me that humans predicting word for word when they speak?

adam_arthur2y ago

A system that "predicts the next token" in such a way that it is indistinguishable from a human, is just like a human in practice yes.

How does a human decide which word to use in your mind? Magic?

stevenhuang2y ago

Actually yes, architecturally that's the essence of predictive coding.

It's among the leading theories in neuroscience for how our brains work https://en.wikipedia.org/wiki/Predictive_coding

19h2y ago· 19 in thread

We’re classifying gigabytes of intel (SOCMINT / HUMINT) per second and found semantic folding or better in classification quality vs throughput than BERT / LLMs.

How it works — imagine you’re having these sentences:

“Acorn is a tree” and “acorn is an app”

You essentially keep record of all word to word relations internal to a sentence:

- acorn: is, a, an, app, tree Etc.

Now you repeat this for a few gigabytes of text. You’ll end up with a huge map of “word connections”.

You’ll end up with a vector that has a lot of zeroes — you can now sparsify it (I.e. store only the positions of the ones).

Sorry, couldn’t be too specific as I’m on the go - if you’re interested drop me a mail.

We’re using this to categorize literally tens of gigabytes per second with 92% precision into more than 72 categories.

wavemode2y ago

It’s the same as a giant one hot vector. He’s not describing anything terribly new or impressive, but if it works then god bless and good luck.

SomewhatLikely2y ago

Sounds like TF-IDF vectors.

lgas2y ago

Not to dogpile on all the other "isn't this just" messages, but isn't this just sparse embeddings?

lmeyerov2y ago

mmcwilliams2y ago

You're not wrong. This sounds curiously close to the ways I've seen word2vec used in production.

espe2y ago

edit: short version: not semantics and not a fingerprint :)

19h2y ago

We also trained on all of pushshift and have an average ”unknown” word rate of less than 0.007% — the Reddit corpus is rather amazing to capture pretty much all misspellings of a word.

We may only be using 16k vector values but that doesn’t mean we only have a vocab of 16k —- our vocab is more around 1.9 million words each described by a sparse fingerprint of 16k.

foolswisdom2y ago

I'm curious though, how do you handle related forms of a word (assuming you don't use stemming)? It doesn't seem to me that this process would automatically handle that.

espe2y ago

thanks for the clarification. if your base population is that large then it's frequencies and you get a fingerprint. well done.

dr_kiszonka2y ago

19h2y ago

Interesting idea! Can you elaborate a bit more?

spyckie22y ago

Just asking, this seems very similar to the attention algorithm that powers LLMs?

- https://en.wikipedia.org/wiki/Semantic_folding

It’s not similar other than that attention relates tokens.

mistrial92y ago

19h2y ago

Care to elaborate? Not sure why this tone is appropriate.

The 92% is an average and not the exact accuracy across all categories; the accuracy varies by category as every category is represented by its own filter.

LewisDavidson2y ago

Do you have any code that demonstrates this? Sounds super interesting!

19h2y ago

Unfortunately, I can't. We have some projects bubbling around that may see the light of the day eventually but given the myriads of NDAs that stack on top of each other this is rather unlikely.

That said, here's some reading material on the underlying ideas:

- https://arxiv.org/pdf/1511.08855.pdf ("Semantic Folding Theory And its Application in Semantic Fingerprinting")

mynegation2y ago

They do but it’s probably… classified.

nestorD2y ago· 17 in thread

LLMs are significantly slower than traditional ML, typically costlier and, I have been told, tend to be less accurate than a traditional model trained on a large dataset.

SkyPuncher2y ago

To me, LLMs feel like "low-code" tools in most applicable domains.

They're very, very good at creating a new, novel solution - but specially trained ML models will rule.

godelski2y ago

> LLMs are significantly slower than traditional ML, typically costlier

Literally point 3 in the article.

> But, they are zero/few shot classifiers

radarsat12y ago

[0]: https://arxiv.org/abs/2109.01652

godelski2y ago

> It comes from this paper

To clarify, what comes from that paper? The claim that LLMs are zero-shot learners (yes) or the term zero-shot (no[0]).

> I believe the idea is that the LLM was not trained on the task in question

Not quite. We'll see in [0] that the definition is

> Zero-shot learning assumes disjoint training and test classes

[0] Zero-Shot Learning with Semantic Output Codes https://www.cs.toronto.edu/~hinton/absps/palatucci.pdf

[1] Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly https://arxiv.org/abs/1707.00600

[2] A Survey of Zero-Shot Learning: Settings, Methods, and Applications https://dl.acm.org/doi/10.1145/3293318

[3] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

PartiallyTyped2y ago

Respectfully, i disagree. I have used LLMs on actually novel tasks for which there aren’t any datasets out there. They “get it”.

> I don't know why people started using this term to describe the domain adaptation or transfer learning, but it is not okay. Zero-shot requires novel classes, and subsets are not novel.

Respectfully, i disagree.

Do the math, or read the paper “LLMs are meta learners”.

godelski2y ago

> I have used LLMs on actually novel tasks for which there aren’t any datasets out there. They “get it”.

> Zero-shot is perfectly valid because there is no backpropagation or weight change involved.

> These two change the effective weight of the matrices.

Tldr: A -> B doesn't require that B -> A

[0]A note on the evaluation of generative models: https://arxiv.org/abs/1511.01844 (link for also obtaining slides and code: http://theis.io/publications/17/)

Also worth looking at many of the works that cite this one: https://www.semanticscholar.org/paper/A-note-on-the-evaluati...

[1a] Assessing Generative Models via Precision and Recall: https://arxiv.org/abs/1806.00035

[1b] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

https://general-pattern-machines.github.io/

famouswaffles2y ago

>Zero-shot means out of domain, and if we're talking about text trained LLMs, there really isn't anything text that is out of domain for them because they are trained on almost anything you can find on the internet.

Side stepping the fact that isn't really how the term is used with models these days, i don't know about that.

jerrygenser2y ago

It is zero shot. The llm is trained to generate next token.

Zero shot is defined as being able to output predictions for classes they were not trained on.

It doesn't mean the input data can't be in ml task domain but that the model was not trained on this particular ML task and/or classes.

godelski2y ago

> It is zero shot. The llm is trained to generate next token.

I'm going to refer you to the sibling comments as they stated similar things and I answered them in depth and do not wish to repeat myself.

But to summarize:

Zero-shot := Goal of f:X → Z but train f':X → Y, where Y ⊂ Z. We test on A⊂Z, where A⊄Y (sometimes definition is A ∩ Y = {∅}, but I'm not being as strict)

> but that the model was not trained on this particular ML task and/or classes.

I'm going to need explicit clarification as to this. Explicitly or implicitly? See sibling comments and note about likelihood and density estimators w.r.t. classification.

The training classes in this case are words in the vocabulary in the context of a sentence.

hellovaiOP2y ago

That's a great summary and insight. We should likely use that verbiage to help make it more crystal clear :)

famouswaffles2y ago

Current State of the art (GPT-4) is not going to be less accurate than whatever bespoke option you can cook up.

withinboredom2y ago

I wouldn't be so sure of that.

mplewis2y ago

This is absolutely untrue.

famouswaffles2y ago

Feel free to show otherwise

2 more replies

potatoman222y ago

golergka2y ago

It will be much slower and costlier.

alexmolas2y ago· 12 in thread

Where's the comparison with traditional ML? In the article I only see the good things about using LLM, but there's no mention to traditional ML besides from the title.

It would be nice to see how compares this "complex" approach against a "simple" TF-IDF + RF or SVM.

specproc2y ago

All I saw were attempts to reproduce some chatgpt output.

hellovaiOP2y ago

SetFit is fairly good, and we do help train SetFit like models for the results you get, however, the issue with SetFit is that its latency and cost benefits come at the price of flexibility.

specproc2y ago

I guess use case is everything. There are numerous reasons, not least of which confidentiality, why chatgpt is a no go for me.

What I'd like to see more of is systematic comparison between chatgpt and classic models. I was hoping to see a bit of this in this article and was disappointed.

espe2y ago

+1 for setfit. a baseline that's hard to beat.

famouswaffles2y ago

hellovaiOP2y ago

Thanks Alex, in this article we focused more on deployment comparisons, for example the cost and latency of what it would take to deploy a BERT based model vs LLMs.

This is probably very obvious to some people, but a lot of people's first experience with any sort of AI is often an LLM, so this is just the first of many posts we hope to share.

jonathankoren2y ago

Yeah, I also find the lack of the comparison suspicious. As is the talk about “hallucinated class labels” being “helpful”.

If I had to take a guess, I suspect the LLM might perform a touch better, but we’re taking fractional percent better. Which is fine, if you have the volume, but a wash otherwise

famouswaffles2y ago

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

famouswaffles2y ago

Current State of the art (GPT-4) is mostly on par with experts and much better than crowdworkers. Might be overkill though.

3.5 (what is used here) is better than crowd workers https://arxiv.org/abs/2303.15056

lisasays2y ago

On par with "experts", no.

Per the article: "outperformed the most skilled crowdworkers" on nuanced (but not highly technical) tasks like sentiment labeling.

By definition, it can't outperform the expert ensemble because that's where the gold labels come from.

famouswaffles2y ago

It's as good or better than experts on 7/18 of those benchmarks. On an additional 4, it's close (within 0.05).

>By definition, it can't outperform the expert ensemble because that's where the gold labels come from.

The ensemble no but it can outperform an expert trying to solve it. But yes the benchmarks are biased to the experts here.

viraptor2y ago

rossirpaulo2y ago· 11 in thread

caesil2y ago

If you tell it "Return what this says about sharks or nothing if it does not mention them", it will mess up.

LawTalkingGuy2y ago

Have you tried this sort of prompt?

Stop token: ']}'

wrt OpenAI, text-davinci-003 handles this well, the other models not so much.

dontupvoteme2y ago

caesil2y ago

Yes, agreed. I'm doing this as well. Works excellently for NLP classifier tasks.

https://letscooktime.com/Blog/ai,/machine/learning,/chatgpt,...

galleywest2002y ago

Have you tried using GPT-4s new Function Call feature? The "killer" portion of this is guaranteed JSON based on a schema you pass to the model.

hellovaiOP2y ago

That's a good point! We're actually working on integrating this as well, but in practice, what we've found is that LLM's in general don't like to respond with empty strings for example.

My hypothesis here is that due to RLFH, there's likely some implicit learning that tangentially related content is better than no content.

Given that, you'd likely still get better results with your schema being:

"string | null" so the LLM can output a null instead of "" since there is probably not as much training data that gives "" high log prob values.

But we're looking forward to evaluating the functions call, and seeing what the metrics show!

guhidalg2y ago

I integrated the function calling feature into my personal project and wrote a blog post about it here:

Hopefully this saves you some time!