It’s literally monkeys with typewriters pressing keys randomly.
Until we get new models which have true understanding, they will never be truly useful.
The response was: The sound of cracking knuckles is actually a release of gas in the joint, but there is little scientific evidence to support the idea that it leads to arthritis or other joint problems. In fact, several studies have shown that there is no significant difference in the development of arthritis or other joint conditions between people who crack their knuckles and those who do not.
I wonder if OpenAI used that paper to fine tune ChatGPT
That's only partially correct at best, because joint cracking is not a single phenomenon. For one, there is disagreement over whether the sound comes from the formation of the gas bubble, or the collapse of the gas bubble (cavitation). It is likely that both are partially true.
Furthermore, it doesn't explain the sort of joint cracking in which people can crack a joint repeatedly with no cooldown. I can crack my wrists and ankles by twisting my hands and feet around in circles, cracking once per rotation as fast as I can turn them around. 60 cracks a minute easily; neither of the gas bubble hypothesis can explain this, it is almost certainly a matter of ligaments or bones moving around against each other in a way that creates a snap sound, like snapping your fingers makes a crack sound by slapping your middle finger against the base of your thumb.
I don't think we know exactly what the improvement consists of.
It's also clear that RHLF models can be given instructions to reduce this issue. And in many production LLM models, something called few shot is used - in context learning - where you provide several examples and then ask for a new case. The accuracy is again improved this way, because the model "deduces" that you are being serious and humourless, when asking about mirrors breaking, and are not asking in the context of a story.
It's also one of the only datasets that fails to increase with scale (There was a big, cash-paying high incentive challenge to find other datasets, and it pretty much didn't find any that can't be worked around or are not experienced by RLHF models like ChatGPT). So it doesn't represent a wider trend of truthfulness above human accuracy being impossible (dated/data drift on the other hand, clearly remains an issue).