undefined | Better HN

0 pointsstonemetal1211mo ago0 comments

No, I find it unwilling to produce factual information.

For example Jeep consistently lands at the bottom of the reliability ratings. Try asking GPT if Jeeps are reliable. The response reads like Jeep advertising.

0 comments

s1artibartfast11mo ago

GPt models have a people-pleasing bias and positivity bias. If you want factual information, you have to modify your prompt. I imagine you would get very different results if you asked "are Jeeps more reliable than Toyotas", or "how do Jeeps compare to the median car in terms of reliability"

My impression is that different llms are more or less people pleasing. I found grok is more willing to tell me something is a bad idea.

hbn11mo ago

I wouldn't be surprised if the training was weighted to favor text it learned from more "reliable" or "professional" resources, which in the case of products, would be official sales listings that talk about how great their product is.

jacobr111mo ago

It is more subtle than that. It is trained on "everything" so the specific adjectives and words you use in your prompt will cause it to generate very different responses. The fine tuning causes it to prefer by default more reliable/professional responses ... but that is not because the training data is weighted toward them as such. If you mention specific publications, or forums, it will give you responses more likely to come from those.

Looking at the reasoning traces for for the new reasoning models you can actually see how fine tuning is moving toward having models list the assumptions around data sources, which should be trusted, list multiple perspectives and then summarize, resulting in better answers. You can do that today with non-reasoning models, but you need to prompt engineer it to ask for that explicitly. This process of identifying not just extant content, but teaching systems how to approach problem analysis (instruction tuning, reasoning traces, etc ...) will be key to influencing how the models work and increasingly how they are differentiated.

s1artibartfast11mo ago

I don't think that's a big part of it, although it may be included.

In general, the models lean towards being Yes-Men on just about every topic, including things without official sources. I think this is a byproduct of them being trained to be friendly and agreeable. Nobody wants a product that's rude or contrarian, and this puts a huge finger on the scale. I imagine an a model unfiltered for safety and attitude and political correctness would have less of this bias (but perhaps more of other biases)

1 more reply

afpx11mo ago

Yes, often unreliable. And, they will give different answers at different times.

https://chatgpt.com/share/67f57459-2744-8009-a94e-3b67dce8fd...

“[Jeeps] often score below average in reliability rankings from sources like Consumer Reports and J.D. Power.”

jay_kyburz11mo ago

I tried the trick of asking Gemini to search the web. It respond by telling me Jeep was "average". When I checked the Consumer reports website it gave me, It scores Jeep 19th out of 22. I don't call that average.

llm_nerd11mo ago

Which Gemini? I asked 2.5 Pro and this was the result-

https://g.co/gemini/share/b5e5ea80548b

Seems entirely reasonable to me. Didn't have to trick it into providing citations.

1 more reply

0xbadcafebee11mo ago

Instead ask it to show you links to websites that review reliability ratings and highlight the results for Jeeps along with sources. It's annoying, but how you ask it questions is often more important than what you're asking. (This was a thing when search engines were first introduced too)

marcusverus11mo ago

LLMs are like humans. They don't know what you mean, only what you say. They can't tell you what you want to know, they can only answer the question you actually ask! The question you asked is broad and phrased in a way that begs a simplistic answer about the entire brand. Obviously an answer to that question will do a worse job of laying out the relative reliability of current jeep models than would a report which was created to address that specific question.

If you want to know how modern Jeep models stack up against their peers in terms of reliability, try asking GPT that question!

jay_kyburz11mo ago

Actually, you hit the nail on the head, they are _not_ like humans because they only know what you say, not what you mean. A human would understand that when you ask a broad question, you want a broad answer. Broadly speaking, Jeeps are bad.

Our current LLM are kneecapped because they are very reluctant to be negative.

j / k navigate · click thread line to collapse

0 comments

s1artibartfast11mo ago

My impression is that different llms are more or less people pleasing. I found grok is more willing to tell me something is a bad idea.

hbn11mo ago

jacobr111mo ago

s1artibartfast11mo ago

I don't think that's a big part of it, although it may be included.

1 more reply

afpx11mo ago

Yes, often unreliable. And, they will give different answers at different times.

https://chatgpt.com/share/67f57459-2744-8009-a94e-3b67dce8fd...

“[Jeeps] often score below average in reliability rankings from sources like Consumer Reports and J.D. Power.”

jay_kyburz11mo ago

llm_nerd11mo ago

Which Gemini? I asked 2.5 Pro and this was the result-

https://g.co/gemini/share/b5e5ea80548b

Seems entirely reasonable to me. Didn't have to trick it into providing citations.

1 more reply

0xbadcafebee11mo ago

marcusverus11mo ago

If you want to know how modern Jeep models stack up against their peers in terms of reliability, try asking GPT that question!

jay_kyburz11mo ago

Our current LLM are kneecapped because they are very reluctant to be negative.

j / k navigate · click thread line to collapse