Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM (opens in new tab)

(arxiv.org)

147 pointsnaturalauction2y ago45 comments

45 comments

38 comments · 16 top-level

Animats2y ago· 4 in thread

"Responses are selected randomly from a group of base chat AIs. ... The response generated by a specific chat AI is conditional on all previous responses generated by the previously selected chat AIs."

That's all? That works? Useful.

Could that be extended? It doesn't seem inherent in this that all the chat AIs have to be LLMs. Some might be special-purpose systems. Solvers or knowledge bases, such as Wolfram Alpha or a database front end, could play too. Systems at the Alexa/Siri level that can do simple tasks. Domain-specific systems with natural language in and out have been around for decades.

bhickey2y ago

Why they aren't computing the next token marginal and sampling that? All I'm coming up with is that it's a reasonable way to work around dealing with different tokenizers.

nerdponx2y ago

Seems a lot like ensembling methods for traditional predictive models.

gwern2y ago

This is considerably weirder than ensembling because they are not, in any sense, averaging the predictions or taking a majority vote or in some way collectively processing multiple models to yield a single slightly-better meta-model. They are just... randomly picking a model to use to generate the next response. And users find that more entertaining to talk to?

As there is no analysis of why that is better or evaluation of alternative approaches (what if you alternated A/B/A/B? Or cycled through them systematically A/B/C/A? or picked a different shuffle of A/B/C each time?), it's hard to say what this means.

My best guess is that this reflects the fact that GPT is, thanks to RLHF, boring. It has mode-collapse and does things like tell one of a handful of jokes every time. It will write a rhyming poem even if you ask it for a different kind of poem. And so on.

The random sampling of different models serves as a rather ad hoc way of avoiding the RLHF boringness. The various models might all be tuned similarly, but they won't yield identical results, and this sneaks in response diversity through the backdoor, undoing the same-ness from the RLHF mode collapse.

You used to be able to increase the sampling temperature on GPT to undo some of this blandness, but since RLHF flattens the logits in GPT-4, it's unclear if that still helps. So swapping in random models may be a useful trick. (Although fixing the tuning itself would be much more desirable.)

Animats2y ago

Not quite. The outputs of the models become part of the prompt history of all the models. So they can assist each other. For example, one model might produce a highly technical reply. Then the user can ask for explanations, to be provided by a model with better language capability but less domain expertise.

abeppu2y ago· 3 in thread

Ok, this seems bunk basically because they never really provide evidence of "better".

> ... traditiontal gold-standard approaches use human evaluators that score the quality of generated responses, which can be costly. However, since chat AIs are by definition deployed in social environments with humans, one can leverage statistics of users interaction as a meaningful and aligned measure of chat AI engagingness and quality. To assess the ’quality’ of a chat AI, we consider two main proxy functions: the industry standard user retention and the main objective function, user engagement.

Maybe retention and engagement _are_ sufficiently well correlated to human evaluations, but you should probably do both and show that they're strongly correlated before you decide to just drop the human evaluators in favor of your cheap proxy measurements.

And in this field, where there are some known issues with chat LLMs, perhaps it's important to check stuff like:

- Does the model seem "engaging" just b/c the user has to refine their prompt several times before they get a satisfying response?

- Do responses include a lot of hallucinations which might be engaging but not true?

- Do successive responses show decreased consistency or coherence between messages, in a way that might accidentally elicit continued engagement?

Overall, it seems sloppy to believe that it's not a waste of humans time to talk to your chatbots, and it's not a waste of time for readers to look at this paper about your chatbots, but it's too expensive for you to actually measure the quality of responses from your chatbots.

yorwba2y ago

They're making chatbots specifically for humans to waste time with them (a.k.a. entertainment.)

Engagement and user retention are directly connected to their bottom line in a way that quality responses (e.g. introducing you to a more fulfilling hobby than chatting with AIs) are not.

pk-protect-ai2y ago

That is what I read in this paper as well. It is not about "better as better performance" it is "better as improved user retention".

2 more replies

puma_xxx2y ago

This criticism seems out of touch.

They are presenting a real world use case where retention and engagement is clearly the metric of interest. It's not even clear what "human evaluations" would even mean in this context.

Kudos to not falling into the benchmark / human eval trap, and just testing your theories directly at scale in a deployment setting.

enoch20902y ago· 3 in thread

I've said this a few times previously yet I certainly want to say it again - "All You Need" titles are definetely not what all we need.

jamesrom2y ago

What is especially egregious about the title is that "blending" requires stuff to blend. It's not all you need! It makes no sense at all.

mikeiavelli2y ago

"All you need are better titles!"

georgemcbay2y ago

I'm working on a paper entitled "All You Need Considered Harmful" to address this over-usage of the phrase.

miven2y ago· 3 in thread

Now that I think about it, doesn't this "technique" triple the amount of compute and memory per generated token since each model needs to also compute and store the KV values for the two previous tokens it didn't generate and thus has never seen?

Edit: On second thought, depending on how it's actually implemented the other two tokens are probably ran through the model in parallel so it shouldn't be all that much slower.

breckenedge2y ago

It doesn’t generate three responses for every turn. It randomly picks a model for every response, the claim being that the switching between different models leads to better conversations because of the diversity of each model’s training.

miven2y ago

Correct me if I'm wrong but usually when you do normal token by token inference in a transformer you store calculations made in the previous step in a KV cache so you can reuse it instead of calculating it all over again.

But here since the previous few tokens were produced by another model, the current model has never seen them and as such, by definition, doesn't have those calculations stored, but it still needs them to properly calculate attention for the new token.

1 more reply

leblancfg2y ago

It reads like that, yeah. Although 3 x 6B is still an order of magnitude smaller than ChatGPT's purported 175B

m3kw92y ago· 2 in thread

I really would like them to compare to Gpt4 instead of claiming victory when matching 3.5. To me GPT4 is the first usable one for a lot of professional uses. 3.5 is fun and gets some stuff right but it’s like a demo.

brucethemoose22y ago

Honestly, the baseline models they test and blend are really terrible as well. Especially Pygmalion 6B, which is like ancient history.

A Yi 34B or Mixtral finetune on the same data would blow them out of the water. Probably blow ChatGPT 3.5 out of the water as well.

carlmr2y ago

>the baseline models they test and blend are really terrible as well.

If the effect is there I would guess a few bad models should outperform a mediocre one, and a few mediocre ones should outperform a state-of-the-art one.

Of course it would be good to show the same again with GPT4 and maybe 3 GPT3.5 size models, but it's not necessary to show that such an effect exists, and maybe cost prohibitive for them as a research team. Now whether their methodology for proving this effect is correct is another discussion.

Personally I don't find these results surprising, our brain is also somewhat compartmentalized, why wouldn't the same hold for a good AI system?

The more difficult part is, how do you train these subnetworks optimally.

rfw3002y ago· 2 in thread

The paper refers to ChatGPT as a 175B parameter LLM. This is almost certainly incorrect; the original largest version of GPT-3 was 175B but analysis of the speed and cost of the current model as well as public statements by OpenAI indicate it’s as much as 5-10x smaller.

Klaus232y ago

I think it was leaked that it is 20B now.

miven2y ago

It was mentioned to be a 20B in a comparison table in a paper co-written by Microsoft, but they've since claimed that it's just an error, and I mean, they'd need to be sitting on some really impressive distilling techniques to shrink a 175B model down to 20B with only a slight drop in performance.

1 more reply

block_dagger2y ago· 2 in thread

Reminds me of Numenta's Thousand Brains Theory of Intelligence.

debatem12y ago

When this is all settled I suspect we're going to find ourselves the drivers of a chariot, with each horse being an external and artificial mind given direction by our evolved needs.

LASR2y ago

I really don't think it's realisitic that we will maintain intellectual superiority in the long-term.

So a more realistic hope would be: we're the horses, and the drivers are driving us to our carrots.

2 more replies

sp3322y ago· 1 in thread

Is it weird to refer to GPT-3.5 as "state of the art" when GPT-4 is right there? Actually the paper uses davinci interchangeably with GPT-3.5 (sometimes without a hyphen) and ChatGPT.

mewpmewp22y ago

So many people seem to confuse beating GPT-3.5 in general to be the hallmark. It's immediate hint they have no idea. There's a clear and vast difference between GPT-4 and 3.5, making GPT-3.5 almost worthless except for fast summarisation tasks perhaps.

You really haven't done much with those models if they seem remotely comparable.

To me GPT-3.5 can just summarise and provide general answers to questions, but GPT-4 can actually understand nuance and to me what seems to be reasoning.

denimboy2y ago· 1 in thread

mergekit is the tool you need to do this

  https://github.com/cg123/mergekit

you can slice off layers and blend models with different strategies.

brucethemoose22y ago

Mergekit is the best thing since sliced bread, as the local llm community already knows.

The dev's blog is great: https://goddard.blog/posts/

...But its not what this paper is describing. They are basically alternating models, AFAIK. Also I have other nitpicks with the paper, like using extremely old/mediocre chat models as bases:

> Pygmillion 6B, Vicuna 13B, Chai Model 6B

Buttons8402y ago· 1 in thread

How does the blending work? I'm imagining installing a bunch of "AIs" and having them all work together intelligently.

satoru422y ago

According to the paper, it just selects one AI randomly and then uses it to generate the next response.

goethes_kind2y ago

I find it suspicious that they would use user engagement and retention and none of the normal benchmarks to test their model.

jeffrallen2y ago

"All you need" is all you need, apparently, to get an AI paper in HN.

patrickhogan12y ago

Foundational models are designed to be universally applicable, covering a wide range of use cases. While it's relatively easy to tailor smaller models to specific scenarios through overfitting, when a model is overly specialized, it loses its broad applicability and ceases to be a foundational model.

huytersd2y ago

I’ve said this before but every time someone uses 3.5 to make a point, there’s an agenda.

teddyh2y ago

Three small LLMs in a trenchcoat.

1 more reply

matmulbro2y ago

machine learning papers are astrology for men

j / k navigate · click thread line to collapse

45 comments

38 comments · 16 top-level

Animats2y ago· 4 in thread

That's all? That works? Useful.

bhickey2y ago

Why they aren't computing the next token marginal and sampling that? All I'm coming up with is that it's a reasonable way to work around dealing with different tokenizers.

nerdponx2y ago

Seems a lot like ensembling methods for traditional predictive models.

gwern2y ago

Animats2y ago

abeppu2y ago· 3 in thread

Ok, this seems bunk basically because they never really provide evidence of "better".

And in this field, where there are some known issues with chat LLMs, perhaps it's important to check stuff like:

- Does the model seem "engaging" just b/c the user has to refine their prompt several times before they get a satisfying response?

- Do responses include a lot of hallucinations which might be engaging but not true?

- Do successive responses show decreased consistency or coherence between messages, in a way that might accidentally elicit continued engagement?

yorwba2y ago

They're making chatbots specifically for humans to waste time with them (a.k.a. entertainment.)

Engagement and user retention are directly connected to their bottom line in a way that quality responses (e.g. introducing you to a more fulfilling hobby than chatting with AIs) are not.

pk-protect-ai2y ago

That is what I read in this paper as well. It is not about "better as better performance" it is "better as improved user retention".

2 more replies

puma_xxx2y ago

This criticism seems out of touch.

They are presenting a real world use case where retention and engagement is clearly the metric of interest. It's not even clear what "human evaluations" would even mean in this context.

Kudos to not falling into the benchmark / human eval trap, and just testing your theories directly at scale in a deployment setting.

enoch20902y ago· 3 in thread

I've said this a few times previously yet I certainly want to say it again - "All You Need" titles are definetely not what all we need.

jamesrom2y ago

What is especially egregious about the title is that "blending" requires stuff to blend. It's not all you need! It makes no sense at all.

mikeiavelli2y ago

"All you need are better titles!"

georgemcbay2y ago

I'm working on a paper entitled "All You Need Considered Harmful" to address this over-usage of the phrase.

miven2y ago· 3 in thread

Edit: On second thought, depending on how it's actually implemented the other two tokens are probably ran through the model in parallel so it shouldn't be all that much slower.

breckenedge2y ago

miven2y ago

1 more reply

leblancfg2y ago

It reads like that, yeah. Although 3 x 6B is still an order of magnitude smaller than ChatGPT's purported 175B

m3kw92y ago· 2 in thread

brucethemoose22y ago

Honestly, the baseline models they test and blend are really terrible as well. Especially Pygmalion 6B, which is like ancient history.

A Yi 34B or Mixtral finetune on the same data would blow them out of the water. Probably blow ChatGPT 3.5 out of the water as well.

carlmr2y ago

>the baseline models they test and blend are really terrible as well.

If the effect is there I would guess a few bad models should outperform a mediocre one, and a few mediocre ones should outperform a state-of-the-art one.

Personally I don't find these results surprising, our brain is also somewhat compartmentalized, why wouldn't the same hold for a good AI system?

The more difficult part is, how do you train these subnetworks optimally.

rfw3002y ago· 2 in thread

Klaus232y ago

I think it was leaked that it is 20B now.

miven2y ago

1 more reply

block_dagger2y ago· 2 in thread

Reminds me of Numenta's Thousand Brains Theory of Intelligence.

debatem12y ago

When this is all settled I suspect we're going to find ourselves the drivers of a chariot, with each horse being an external and artificial mind given direction by our evolved needs.

LASR2y ago

I really don't think it's realisitic that we will maintain intellectual superiority in the long-term.

So a more realistic hope would be: we're the horses, and the drivers are driving us to our carrots.

2 more replies

sp3322y ago· 1 in thread

Is it weird to refer to GPT-3.5 as "state of the art" when GPT-4 is right there? Actually the paper uses davinci interchangeably with GPT-3.5 (sometimes without a hyphen) and ChatGPT.

mewpmewp22y ago

You really haven't done much with those models if they seem remotely comparable.

To me GPT-3.5 can just summarise and provide general answers to questions, but GPT-4 can actually understand nuance and to me what seems to be reasoning.

denimboy2y ago· 1 in thread

mergekit is the tool you need to do this

  https://github.com/cg123/mergekit

you can slice off layers and blend models with different strategies.

brucethemoose22y ago

Mergekit is the best thing since sliced bread, as the local llm community already knows.

The dev's blog is great: https://goddard.blog/posts/

...But its not what this paper is describing. They are basically alternating models, AFAIK. Also I have other nitpicks with the paper, like using extremely old/mediocre chat models as bases:

> Pygmillion 6B, Vicuna 13B, Chai Model 6B

Buttons8402y ago· 1 in thread

How does the blending work? I'm imagining installing a bunch of "AIs" and having them all work together intelligently.

satoru422y ago

According to the paper, it just selects one AI randomly and then uses it to generate the next response.

goethes_kind2y ago

I find it suspicious that they would use user engagement and retention and none of the normal benchmarks to test their model.

jeffrallen2y ago

"All you need" is all you need, apparently, to get an AI paper in HN.

patrickhogan12y ago

huytersd2y ago

I’ve said this before but every time someone uses 3.5 to make a point, there’s an agenda.

teddyh2y ago

Three small LLMs in a trenchcoat.

1 more reply

matmulbro2y ago

machine learning papers are astrology for men

j / k navigate · click thread line to collapse