> ... traditiontal gold-standard approaches use human evaluators that score the quality of generated responses, which can be costly. However, since chat AIs are by definition deployed in social environments with humans, one can leverage statistics of users interaction as a meaningful and aligned measure of chat AI engagingness and quality. To assess the ’quality’ of a chat AI, we consider two main proxy functions: the industry standard user retention and the main objective function, user engagement.
Maybe retention and engagement _are_ sufficiently well correlated to human evaluations, but you should probably do both and show that they're strongly correlated before you decide to just drop the human evaluators in favor of your cheap proxy measurements.
And in this field, where there are some known issues with chat LLMs, perhaps it's important to check stuff like:
- Does the model seem "engaging" just b/c the user has to refine their prompt several times before they get a satisfying response?
- Do responses include a lot of hallucinations which might be engaging but not true?
- Do successive responses show decreased consistency or coherence between messages, in a way that might accidentally elicit continued engagement?
Overall, it seems sloppy to believe that it's not a waste of humans time to talk to your chatbots, and it's not a waste of time for readers to look at this paper about your chatbots, but it's too expensive for you to actually measure the quality of responses from your chatbots.
Engagement and user retention are directly connected to their bottom line in a way that quality responses (e.g. introducing you to a more fulfilling hobby than chatting with AIs) are not.
They are presenting a real world use case where retention and engagement is clearly the metric of interest. It's not even clear what "human evaluations" would even mean in this context.
Kudos to not falling into the benchmark / human eval trap, and just testing your theories directly at scale in a deployment setting.
That's all? That works? Useful.
Could that be extended? It doesn't seem inherent in this that all the chat AIs have to be LLMs. Some might be special-purpose systems. Solvers or knowledge bases, such as Wolfram Alpha or a database front end, could play too. Systems at the Alexa/Siri level that can do simple tasks. Domain-specific systems with natural language in and out have been around for decades.
As there is no analysis of why that is better or evaluation of alternative approaches (what if you alternated A/B/A/B? Or cycled through them systematically A/B/C/A? or picked a different shuffle of A/B/C each time?), it's hard to say what this means.
My best guess is that this reflects the fact that GPT is, thanks to RLHF, boring. It has mode-collapse and does things like tell one of a handful of jokes every time. It will write a rhyming poem even if you ask it for a different kind of poem. And so on.
The random sampling of different models serves as a rather ad hoc way of avoiding the RLHF boringness. The various models might all be tuned similarly, but they won't yield identical results, and this sneaks in response diversity through the backdoor, undoing the same-ness from the RLHF mode collapse.
You used to be able to increase the sampling temperature on GPT to undo some of this blandness, but since RLHF flattens the logits in GPT-4, it's unclear if that still helps. So swapping in random models may be a useful trick. (Although fixing the tuning itself would be much more desirable.)
A Yi 34B or Mixtral finetune on the same data would blow them out of the water. Probably blow ChatGPT 3.5 out of the water as well.
If the effect is there I would guess a few bad models should outperform a mediocre one, and a few mediocre ones should outperform a state-of-the-art one.
Of course it would be good to show the same again with GPT4 and maybe 3 GPT3.5 size models, but it's not necessary to show that such an effect exists, and maybe cost prohibitive for them as a research team. Now whether their methodology for proving this effect is correct is another discussion.
Personally I don't find these results surprising, our brain is also somewhat compartmentalized, why wouldn't the same hold for a good AI system?
The more difficult part is, how do you train these subnetworks optimally.
You really haven't done much with those models if they seem remotely comparable.
To me GPT-3.5 can just summarise and provide general answers to questions, but GPT-4 can actually understand nuance and to me what seems to be reasoning.
https://github.com/cg123/mergekit
you can slice off layers and blend models with different strategies.The dev's blog is great: https://goddard.blog/posts/
...But its not what this paper is describing. They are basically alternating models, AFAIK. Also I have other nitpicks with the paper, like using extremely old/mediocre chat models as bases:
> Pygmillion 6B, Vicuna 13B, Chai Model 6B
So a more realistic hope would be: we're the horses, and the drivers are driving us to our carrots.
Edit: On second thought, depending on how it's actually implemented the other two tokens are probably ran through the model in parallel so it shouldn't be all that much slower.
But here since the previous few tokens were produced by another model, the current model has never seen them and as such, by definition, doesn't have those calculations stored, but it still needs them to properly calculate attention for the new token.