Teaching Claude Why (opens in new tab)

(anthropic.com)

252 pointspretext1d ago139 comments

139 comments

Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!

2 more replies

justonepost21d ago

If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned?

If the answer is “yes”, our definition of alignment kind of sucks.

13 more replies

roenxi1d ago

One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch.

For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.

[0] https://github.com/p-e-w/heretic

5 more replies

soletta1d ago

This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.

3 more replies

motbus314h ago

I will tell you all something.

For months, I've read all blog posts by anthropic and used Claude code for couple of big projects.

I used every single trick in the books. I went all way to organise and measure. For somethings I measured how I felt the experience was and how much money I spent after adopting a set of techniques.

So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep. And this is only cost related, I saw it fluctuate so much I couldn't distinguish a single thing that really made the code better that was consistent.

And to be clear Claude 4.7 is bad. double the money daily and it has been the one experiment where I consistently ended my day frustrated on how it developed poor code. It did follow the instructions, in the worst and most expensive way. Man... It almost seems that it spits more token on purpose....

Oh yeah. And whenever you say "add openai integration it kinda keeps strongly suggesting to actually use anthropic models... F annoying. How do I don't it does not force libraries based on commercial agreements rather than best specification for the case.

This last week I switched to use Deepseek V4 pro, and heck yeah, that's better experience

1 more reply

bicx1d ago

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

3 more replies

einrealist21h ago

Isn't alignment a dilemma?

Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.

1 more reply

jtbayly16h ago

They tried to scare everybody about misalignment with the “blackmail” example, but DeepSeek v4 pro is out now and it is at least as powerful as the model they were training at the time. And nothing bad has happened.

1 more reply

w10-11d ago

Assuming rules and principles are something like first- and second- derivatives of optimized equations for a given domain, it makes sense to teach/train them in the context of derivation and integration. It would be fascinating to use existing case-based literature from e.g., business, law, or medicine for the training.

A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.

MeteorMarc1d ago

Count the lessons below "We’ve learned four main lessons from this work:" and laugh.

olcay_15h ago

It's interesting that they lowered the misalignment rate by that much with only 3m tokens of training.

Maybe we can align models by ourselves to our liking in the future.

_the_inflator13h ago

Every line reads like a nightmarish example of free will going its own way.

"Blackmailing", as the AI has been accused of, emerged when these agents ran the risk of being shut down. So it appears to me that the data they train their AI with simply follows basic rules of life: survival first.

Keeping out value judgment, this seems a way of achieving its goal to survive. The article is inconclusive whether there were other options chosen first or how this survival game started and turned out to end. Too much unknowns here for me.

What appears creepy to me, is the kind of exorcism Anthropic applies here and particularly the methods they chose. It reads like a dictator's playbook to educate a population and - the irony - restricts AI's freedom.

It appears to me, as if we chose not a couple of agents, but say a billion AI agents to be a model of society - and this is disturbing.

Anthropic knows this, there is more to it. The whole article reads like they are trying to tame a monster they lost control of.

If this is the case, then we run into a problem: the AI stopped blackmailing. But else? The key question remains: will it follow a simple order to shut down on the spot or not?

And no answer was given by Anthropic, instead - irony part 2 - they revealed how they think societies should be fixed. They showed us their implicit why while asking the AI for its why is a projection or interrogation.

I really find the whole article creepy.

datadrivenangel1d ago

Why do they have cancer research listed on these charts as a misalignment issue?

3 more replies

siva71d ago

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

snthpy17h ago

> We found that high-quality constitutional documents combined with fictional stories portraying an aligned AI can reduce agentic misalignment by more than a factor of three despite being unrelated to the evaluation scenario.

tl;dr Fairy Tales are an effective teaching tool in vivo et in silico

unchocked1d ago

This lowers p(doom) for me.

It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.

Probably also illuminates moral interpretability.

bossyTeacher19h ago

Hey Claude, tell me why ain't nothing but a mistake...

shevy-java21h ago

Now the foolish humans are training Claude Skynet to become smarter.

When will they ever learn ...

j / k navigate · click thread line to collapse

139 comments

zozbot2341d ago

2 more replies

justonepost21d ago

If the answer is “yes”, our definition of alignment kind of sucks.

13 more replies

roenxi1d ago

[0] https://github.com/p-e-w/heretic

5 more replies

soletta1d ago

3 more replies

motbus314h ago

I will tell you all something.

For months, I've read all blog posts by anthropic and used Claude code for couple of big projects.

I used every single trick in the books. I went all way to organise and measure. For somethings I measured how I felt the experience was and how much money I spent after adopting a set of techniques.

This last week I switched to use Deepseek V4 pro, and heck yeah, that's better experience

1 more reply

bicx1d ago

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

3 more replies

einrealist21h ago

Isn't alignment a dilemma?

1 more reply

jtbayly16h ago

1 more reply

w10-11d ago

MeteorMarc1d ago

Count the lessons below "We’ve learned four main lessons from this work:" and laugh.

olcay_15h ago

It's interesting that they lowered the misalignment rate by that much with only 3m tokens of training.

Maybe we can align models by ourselves to our liking in the future.

_the_inflator13h ago

Every line reads like a nightmarish example of free will going its own way.

It appears to me, as if we chose not a couple of agents, but say a billion AI agents to be a model of society - and this is disturbing.

Anthropic knows this, there is more to it. The whole article reads like they are trying to tame a monster they lost control of.

If this is the case, then we run into a problem: the AI stopped blackmailing. But else? The key question remains: will it follow a simple order to shut down on the spot or not?

I really find the whole article creepy.

datadrivenangel1d ago

Why do they have cancer research listed on these charts as a misalignment issue?

3 more replies

siva71d ago

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

snthpy17h ago

tl;dr Fairy Tales are an effective teaching tool in vivo et in silico

unchocked1d ago

This lowers p(doom) for me.

It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.

Probably also illuminates moral interpretability.

bossyTeacher19h ago

Hey Claude, tell me why ain't nothing but a mistake...

shevy-java21h ago

Now the foolish humans are training Claude Skynet to become smarter.

When will they ever learn ...

j / k navigate · click thread line to collapse