For example Jeep consistently lands at the bottom of the reliability ratings. Try asking GPT if Jeeps are reliable. The response reads like Jeep advertising.
My impression is that different llms are more or less people pleasing. I found grok is more willing to tell me something is a bad idea.
Looking at the reasoning traces for for the new reasoning models you can actually see how fine tuning is moving toward having models list the assumptions around data sources, which should be trusted, list multiple perspectives and then summarize, resulting in better answers. You can do that today with non-reasoning models, but you need to prompt engineer it to ask for that explicitly. This process of identifying not just extant content, but teaching systems how to approach problem analysis (instruction tuning, reasoning traces, etc ...) will be key to influencing how the models work and increasingly how they are differentiated.
In general, the models lean towards being Yes-Men on just about every topic, including things without official sources. I think this is a byproduct of them being trained to be friendly and agreeable. Nobody wants a product that's rude or contrarian, and this puts a huge finger on the scale. I imagine an a model unfiltered for safety and attitude and political correctness would have less of this bias (but perhaps more of other biases)
https://chatgpt.com/share/67f57459-2744-8009-a94e-3b67dce8fd...
“[Jeeps] often score below average in reliability rankings from sources like Consumer Reports and J.D. Power.”
https://g.co/gemini/share/b5e5ea80548b
Seems entirely reasonable to me. Didn't have to trick it into providing citations.
If you want to know how modern Jeep models stack up against their peers in terms of reliability, try asking GPT that question!
Our current LLM are kneecapped because they are very reluctant to be negative.