Alignment is not free: How model upgrades can silence your confidence signals (opens in new tab)

(variance.co)

121 pointskarinemellata1y ago67 comments

67 comments

30 comments · 8 top-level

erwin-co1y ago· 9 in thread

Why not make a completely raw uncensored LLM? Seems it would be more "intelligent".

"LLM whisperer" folks will confidently claim that base models are substantially smarter than fine-tuned chat models; with qualitative differences in capabilities. But you have to be an LLM whisperer to get useful work out of a base model, since they're not SFT'ed, RLHF'ed, or RLAIF'ed into actually wanting to help you.

andai1y ago

How can I learn more about this?

Is it like in the early GPT-3 days, when you had to give it a bunch of examples and hope it catches the pattern?

2 more replies

Der_Einzige1y ago

Me being old man yelling at cloud about how your chat/tool template matters more than your post-training technique.

DeepSeek-R1 is trivially converted back to a non reasoning model with just chat template modifications. I bet you can chat template your way into a good quality model from a base model, no RLHF/DPO/SFT/GRPO needed.

msp261y ago

Brand safety. Journalists would write articles about the models being 'dangerous'.

qwertytyyuu1y ago

Before rlhf, it’s much harder to use, remember the difference between gtp3 and chat gpt. The fine tuning for chat made it easier to use

1 more reply

teruakohatu1y ago

In theory that sounds great, but most LLM providers are trying to produce useful models that ultimately will be widely used and make them money.

A model that is more correct but swears and insults the user won't sell. Likewise a model that gives criminal advice is likely to open the company up to lawsuits in certain countries.

A raw LLM might perform better on a benchmark but it will not sell well.

andai1y ago

Disgusted by ChatGPT's flattery and willingness to go along with my half-baked nonsense, I created an anti-ChatGPT, which is unfriendly and pushes back on nonsense as hard as possible.

All my friends hate it, except one guy. I used it for a few days, but it was exhausting.

I figured out the actual use cases I was using it for, and created specialized personas that work better for each one. (Project planning, debugging mental models, etc.)

I now mostly use a "softer" persona that's prompted to point out cognitive distortions. At some point I realized, I've built a therapist. Hahaha.

alganet1y ago

What kinds of contents do you want them to produce that they currently do not?

simion3141y ago

>What kinds of contents do you want them to produce that they currently do not?

OpenAI models refuse to translate or do any transformation for some traditional, popular stories because of violence, the story was about a bad wolf eating some young goats that did not listen the advice from their mother.

So now try to give me a prompt that works with any text and that convinces the AI that is ok in fiction to have violence or bad guys/animals that get punished.

Now I am also considering if it censors the bible where some pretend good God kills young chilren with ugly illnesses to punish the adults, or for this book they made excaptions.

1 more reply

behnamoh1y ago· 6 in thread

there's evidence that alignment also significantly reduces model creativity: https://arxiv.org/abs/2406.05587

it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

exe341y ago

> it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

This reminds me of the time when I was a child, and my parents decreed that all communications would henceforth happen in English. I became selectively mute. I responded yes/no, and had nothing further to add and ventured no further information. The decree lasted about a week.

andai1y ago

What did you use to communicate before that? Were you fluent in English?

1 more reply

malfist1y ago

How are you defining "creativity" in context with a statistical model?

hansvm1y ago

> defined as syntactic and semantic diversity

1 more reply

Alex_0011y ago

That paper is a great pointer — the creativity vs. alignment trade-off feels a lot like the "risk-aversion" effect in humans under censorship or heavy supervision. It makes me wonder: as we push models to be more aligned, are we inherently narrowing their output distribution to safer, more average responses?

And if so, where’s the balance? Could we someday see dual-mode models — one for safety-critical tasks, and another more "raw" mode for creative or exploratory use, gated by context or user trust levels?

gamman1y ago

Maybe this maps to some human structures that manage control-creativity tardeoff through hierarchy?

I feel that companies with top-down management would have more agency and perhaps creativity towards (but not at) the top, and the implementation would be delegated to bottom layers with increasing levels of specification and restriction.

If this translates, we might have multiple layers with varied specialization and control, and hopefully some feedback mechanisms about feasibility.

Since some hierarchies are familiar to us from real-life, we might prefer these to start with.

It can be hard to find humans that are very creative but also able to integrate consistently and reliably (in a domain). Maybe a model doing both well would also be hard to build compared to stacking few different ones on top of each other with delegation.

I know it's already being done by dividing tasks between multiple steps and models / contexts in order to improve efficiency, but having explicit strong differences of creativity between layers sounds new to me.

1 more reply

Centigonal1y ago· 3 in thread

Very interesting! The one thing I don't understand is how the author made the jump from "we lost the confidence signal in the move to 4.1-mini" and "this is because of the alignment/steerability improvements."

Previous OpenAI models were instruct-tuned or otherwise aligned, and the author even mentions that model distillation might be destroying the entropy signal. How did they pinpoint alignment as the cause?

mlin45891y ago

Good question! We do know from OpenAI's system card from GPT-4 that the post-trained RLHF model is significantly less calibrated compared to the pre-trained model, so it's a matter of speculation that something similar is occurring. However, it's more of a hunch more than anything. I would be curious if it's possible to reproduce this behavior, or the impact of distillation on calibration.

Disclaimer: I wrote this blog post.

itchyjunk1y ago

Could you please elaborate what less or more calibrated means here? Thanks!

2 more replies

Workaccount21y ago

Wouldn't it be something if AI parlance crept into common parlance...

2 more replies

sega_sai1y ago· 3 in thread

Can we have models also return a probability, reflecting how accurate the statements it made is ?

cyanydeez1y ago

Sure, but then you need probability stats on the probability stats.

sega_sai1y ago

I am not sure what you mean. The idea is that the network should return the text, and a confidence expressed as probability. When trained, the log-score should be optimized. (i'm not sure it would actually work given how the training is structured, but something like this would be useful)

1 more reply

jsnider31y ago

You can ask a model to give you probability estimates of its confidence, but none of the frontier models were trained to be good at giving probability estimates to my knowledge.

gotoeleven1y ago· 1 in thread

I don't know if its still comedy or has now reached the stage of farce, but I still at least always get a good laugh when I see another article about the shock and surprise of researchers finding that training LLMs to be politically correct makes them dumber. How long until they figure out that the only solution is to know the correct answer but to give the politically correct answer (which is the strategy humans use) ?

Technically, why not implement alignment/debiasing as a secondary filter with its own weights that are independent of the core model which is meant to model reality? I suspect it may be hard to get enough of the right kind of data to train this filter model, and most likely it would be best to have the identity of the user be in the objective.

mlin45891y ago

The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus...

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.

qwertytyyuu1y ago

People use llm as part of their high precision systems? That’s worrying

user_78321y ago

It’s kinda ironic but parts of the article read like they were written by an LLLM itself

rusk1y ago

Upgrade scripts it is so. plus ca change

j / k navigate · click thread line to collapse

67 comments

30 comments · 8 top-level

erwin-co1y ago· 9 in thread

Why not make a completely raw uncensored LLM? Seems it would be more "intelligent".

khafra1y ago

andai1y ago

How can I learn more about this?

Is it like in the early GPT-3 days, when you had to give it a bunch of examples and hope it catches the pattern?

2 more replies

Der_Einzige1y ago

Me being old man yelling at cloud about how your chat/tool template matters more than your post-training technique.

msp261y ago

Brand safety. Journalists would write articles about the models being 'dangerous'.

qwertytyyuu1y ago

Before rlhf, it’s much harder to use, remember the difference between gtp3 and chat gpt. The fine tuning for chat made it easier to use

1 more reply

teruakohatu1y ago

In theory that sounds great, but most LLM providers are trying to produce useful models that ultimately will be widely used and make them money.

A model that is more correct but swears and insults the user won't sell. Likewise a model that gives criminal advice is likely to open the company up to lawsuits in certain countries.

A raw LLM might perform better on a benchmark but it will not sell well.

andai1y ago

Disgusted by ChatGPT's flattery and willingness to go along with my half-baked nonsense, I created an anti-ChatGPT, which is unfriendly and pushes back on nonsense as hard as possible.

All my friends hate it, except one guy. I used it for a few days, but it was exhausting.

I figured out the actual use cases I was using it for, and created specialized personas that work better for each one. (Project planning, debugging mental models, etc.)

I now mostly use a "softer" persona that's prompted to point out cognitive distortions. At some point I realized, I've built a therapist. Hahaha.

alganet1y ago

What kinds of contents do you want them to produce that they currently do not?

simion3141y ago

>What kinds of contents do you want them to produce that they currently do not?

So now try to give me a prompt that works with any text and that convinces the AI that is ok in fiction to have violence or bad guys/animals that get punished.

Now I am also considering if it censors the bible where some pretend good God kills young chilren with ugly illnesses to punish the adults, or for this book they made excaptions.

1 more reply

behnamoh1y ago· 6 in thread

there's evidence that alignment also significantly reduces model creativity: https://arxiv.org/abs/2406.05587

it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

exe341y ago

> it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

andai1y ago

What did you use to communicate before that? Were you fluent in English?

1 more reply

malfist1y ago

How are you defining "creativity" in context with a statistical model?

hansvm1y ago

> defined as syntactic and semantic diversity

1 more reply

Alex_0011y ago

gamman1y ago

Maybe this maps to some human structures that manage control-creativity tardeoff through hierarchy?

If this translates, we might have multiple layers with varied specialization and control, and hopefully some feedback mechanisms about feasibility.

Since some hierarchies are familiar to us from real-life, we might prefer these to start with.

1 more reply

Centigonal1y ago· 3 in thread

mlin45891y ago

Disclaimer: I wrote this blog post.

itchyjunk1y ago

Could you please elaborate what less or more calibrated means here? Thanks!

2 more replies

Workaccount21y ago

Wouldn't it be something if AI parlance crept into common parlance...

2 more replies

sega_sai1y ago· 3 in thread

Can we have models also return a probability, reflecting how accurate the statements it made is ?

cyanydeez1y ago

Sure, but then you need probability stats on the probability stats.

sega_sai1y ago

1 more reply

jsnider31y ago

You can ask a model to give you probability estimates of its confidence, but none of the frontier models were trained to be good at giving probability estimates to my knowledge.

gotoeleven1y ago· 1 in thread

mlin45891y ago

The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus...

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.

qwertytyyuu1y ago

People use llm as part of their high precision systems? That’s worrying

user_78321y ago

It’s kinda ironic but parts of the article read like they were written by an LLLM itself

rusk1y ago

Upgrade scripts it is so. plus ca change

j / k navigate · click thread line to collapse