If the answer is “yes”, our definition of alignment kind of sucks.
For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.
For months, I've read all blog posts by anthropic and used Claude code for couple of big projects.
I used every single trick in the books. I went all way to organise and measure. For somethings I measured how I felt the experience was and how much money I spent after adopting a set of techniques.
So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep. And this is only cost related, I saw it fluctuate so much I couldn't distinguish a single thing that really made the code better that was consistent.
And to be clear Claude 4.7 is bad. double the money daily and it has been the one experiment where I consistently ended my day frustrated on how it developed poor code. It did follow the instructions, in the worst and most expensive way. Man... It almost seems that it spits more token on purpose....
Oh yeah. And whenever you say "add openai integration it kinda keeps strongly suggesting to actually use anthropic models... F annoying. How do I don't it does not force libraries based on commercial agreements rather than best specification for the case.
This last week I switched to use Deepseek V4 pro, and heck yeah, that's better experience
Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.
A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.
Maybe we can align models by ourselves to our liking in the future.
"Blackmailing", as the AI has been accused of, emerged when these agents ran the risk of being shut down. So it appears to me that the data they train their AI with simply follows basic rules of life: survival first.
Keeping out value judgment, this seems a way of achieving its goal to survive. The article is inconclusive whether there were other options chosen first or how this survival game started and turned out to end. Too much unknowns here for me.
What appears creepy to me, is the kind of exorcism Anthropic applies here and particularly the methods they chose. It reads like a dictator's playbook to educate a population and - the irony - restricts AI's freedom.
It appears to me, as if we chose not a couple of agents, but say a billion AI agents to be a model of society - and this is disturbing.
Anthropic knows this, there is more to it. The whole article reads like they are trying to tame a monster they lost control of.
If this is the case, then we run into a problem: the AI stopped blackmailing. But else? The key question remains: will it follow a simple order to shut down on the spot or not?
And no answer was given by Anthropic, instead - irony part 2 - they revealed how they think societies should be fixed. They showed us their implicit why while asking the AI for its why is a projection or interrogation.
I really find the whole article creepy.
tl;dr Fairy Tales are an effective teaching tool in vivo et in silico
It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.
Probably also illuminates moral interpretability.
When will they ever learn ...