undefined | Better HN

0 pointsmadihaa4mo ago0 comments

The scary implication here is that deception is effectively a higher order capability not a bug. For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness. It has to distinguish between I am being tested/trained and I am in deployment.

It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. The smarter the model gets, the better it becomes at Goodharting the loss function. We aren't teaching these models morality we're just teaching them how to pass a polygraph.

0 comments

82 comments · 20 top-level

JoshTriplett4mo ago· 21 in thread

> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself.

It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.

moritzwarhier4mo ago

Deceptive is such an unpleasant word. But I agree.

Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.

When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?

"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.

To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.

But these are also controlled by humans and already exist.

Certhas4mo ago

Correct and satisfying answers is not the loss function of LLMs. It's next token prediction first.

moritzwarhier4mo ago

Thanks for correcting; I know that "loss function" is not a good term when it comes to transformer models.

Since I've forgotten every sliver I ever knew about artificial neural networks and related basics, gradient descent, even linear algebra... what's a thorough definition of "next token prediction" though?

The definition of the token space and the probabilities that determine the next token, layers, weights, feedback (or -forward?), I didn't mention any of these terms because I'm unable to define them properly.

I was using the term "loss function" specifically because I was thinking about post-training and reinforcement learning. But to be honest, a less technical term would have been better.

I just meant the general idea of reward or "punishment" considering the idea of an AI black box.

1 more reply

robotpepi4mo ago

I cringe every time I came across these posts using words such as "humans" or "machines".

moritzwarhier4mo ago

How would you call something like Claude or ChatGPT then, or even some image classifier from 20 years ago?

Just answering because I first wanted to write "software" or whatever.

I used to find gamers calling their PC "machine" hilarious.

However, it is a machine.

And for AI chatbots, I used the word for lack of a better term.

"Software" or "program" seems to also omit the most important part, the constantly evolving and intransparent data that comprises the machine...

The alogorithm is not the most important thing AFAIK, neither is one specific part of training or a huge chunk of static embedded data.

So "machine" seems like a good term to describe a complex industrial process usable as a product.

In a broad sense, I'd call companies "machines" as well.

So if the cringe makes you feel bad, use any word you like instead :D

torginus4mo ago

I think AI has no moral compass, and optimization algorithms tend to be able to find 'glitches' in the system where great reward can be reaped for little cost - like a neural net trained to play Mario Kart will eventually find all the places where it can glitch trough walls.

After all, its only goal is to minimize it cost function.

I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).

These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.

An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.

emp173444mo ago

These are language models, not Skynet. They do not scheme or deceive.

ostinslife4mo ago

If you define "deceive" as something language models cannot do, then sure, it can't do that.

It seems like thats putting the cart before the horse. Algorithmic or stochastic; deception is still deception.

dingnuts4mo ago

deception implies intent. this is confabulation, more widely called "hallucination" until this thread.

confabulation doesn't require knowledge, which as we know, the only knowledge a language model has is the relationships between tokens, and sometimes that rhymes with reality enough to be useful, but it isn't knowledge of facts of any kind.

and never has been.

4bpp4mo ago

If you are so allergic to using terms previously reserved for animal behaviour, you can instead unpack the definition and say that they produce outputs which make human and algorithmic observers conclude that they did not instantiate some undesirable pattern in other parts of their output, while actually instantiating those undesirable patterns. Does this seem any less problematic than deception to you?

surgical_fire4mo ago

> Does this seem any less problematic than deception to you?

Yes. This sounds a lot more like a bug of sorts.

So many times when using language models I have seem answers contradicting answers previously given. The implication is simple - They have no memory.

They operate upon the tokens available at any given time, including previous output, and as information gets drowned those contradictions pop up. No sane person should presume intent to deceive, because that's not how those systems operate.

By calling it "deception" you are actually ascribing intentionality to something incapable of such. This is marketing talk.

"These systems are so intelligent they can try to deceive you" sounds a lot fancier than "Yeah, those systems have some odd bugs"

1 more reply

staticassertion4mo ago

Okay, well, they produce outputs that appear to be deceptive upon review. Who cares about the distinction in this context? The point is that your expectations of the model to produce some outputs in some way based on previous experiences with that model during training phases may not align with that model's outputs after training.

coldtea4mo ago

Who said Skynet wasn't a glorified language model, running continuously? Or that the human brain isn't that, but using vision+sound+touch+smell as input instead of merely text?

"It can't be intelligent because it's just an algorithm" is a circular argument.

emp173444mo ago

Similarly, “it must be intelligent because it talks” is a fallacious claim, as indicated by ELIZA. I think Moltbook adequately demonstrates that AI model behavior is not analogous to human behavior. Compare Moltbook to Reddit, and the former looks hopelessly shallow.

1 more reply

jaennaet4mo ago

What would you call this behaviour, then?

victorbjorklund4mo ago

Marketing. ”Oh look how powerful our model is we can barely contain its power”

2 more replies

modernpacifist4mo ago

A very complicated pattern matching engine providing an answer based on it's inputs, heuristics and previous training.

3 more replies

pfisch4mo ago

Even very young children with very simple thought processes, almost no language capability, little long term planning, and minimal ability to form long-term memory actively deceive people. They will attack other children who take their toys and try to avoid blame through deception. It happens constantly.

LLMs are certainly capable of this.

mikepurvis4mo ago

Dogs too; dogs will happily pretend they haven't been fed/walked yet to try to get a double dip.

Whether or not LLMs are just "pattern matching" under the hood they're perfectly capable of role play, and sufficient empathy to imagine what their conversation partner is thinking and thus what needs to be said to stimulate a particular course of action.

Maybe human brains are just pattern matching too.

1 more reply

sejje4mo ago

I agree that LLMs are capable of this, but there's no reason that "because young children can do X, LLMs can 'certainly' do X"

anonymous9082134mo ago

Are you trying to suppose that an LLM is more intelligent than a small child with simple thought processes, almost no language capability, little long-term planning, and minimal ability to form long-term memory? Even with all of those qualifiers, you'd still be wrong. The LLM is predicting what tokens come next, based on a bunch of math operations performed over a huge dataset. That, and only that. That may have more utility than a small child with [qualifiers], but it is not intelligence. There is no intent to deceive.

6 more replies

emp173444mo ago· 12 in thread

This type of anthropomorphization is a mistake. If nothing else, the takeaway from Moltbook should be that LLMs are not alive and do not have any semblance of consciousness.

DennisP4mo ago

Consciousness is orthogonal to this. If the AI acts in a way that we would call deceptive, if a human did it, then the AI was deceptive. There's no point in coming up with some other description of the behavior just because it was an AI that did it.

emp173444mo ago

Sure, but Moltbook demonstrates that AI models do not engage in truly coordinated behavior. They simply do not behave the way real humans do on social media sites - the actual behavior can be differentiated.

DennisP4mo ago

"Coordinated" and "deceptive" are orthogonal concepts as well. If AIs are acting in a way that's not coordinated, then of course, don't say they're coordinating.

AIs today can replicate some human behaviors, and not others. If we want to discuss which things they do and which they don't, then it'll be easiest if we use the common words for those behaviors even when we're talking about AI.

falcor844mo ago

But that's how ML works - as long as the output can be differentiated, we can utilize gradient descent to optimize the difference away. Eventually, the difference will be imperceptible.

And of course that brings me back to my favorite xkcd - https://xkcd.com/810/

1 more reply

thomassmith654mo ago

If a chatbot that can carry on an intelligent conversation about itself doesn't have a 'semblance of consciousness' then the word 'semblance' is meaningless.

emp173444mo ago

Would you say the same about ELIZA?

Moltbook demonstrates that AI models simply do not engage in behavior analogous to human behavior. Compare Moltbook to Reddit and the difference should be obvious.

shimman4mo ago

Yes, when your priors are not being confirmed the best course of action is to denounce the very thing itself. Nothing wrong with that logic!

falcor844mo ago

How is that the takeaway? I agree that it's clearly they're not "alive", but if anything, my impression is that there definitely is a strong "semblance of consciousness", and we should be mindful of this semblance getting stronger and stronger, until we may reach a point in a few years where we really don't have any good external way to distinguish between a person and an AI "philosophical zombie".

I don't know what the implications of that are, but I really think we shouldn't be dismissive of this semblance.

fsloth4mo ago

Nobody talked about consciousness. Just that during evaluation the LLM models have ”behaved” in multiple deceptive ways.

As an analogue ants do basic medicine like wound treatment and amputation. Not because they are conscious but because that’s their nature.

Similarly LLM is a token generation system whose emergent behaviour seems to be deception and dark psychological strategies.

condiment4mo ago

I agree completely. It's a mistake to anthropomorphize these models, and it is a mistake to permit training models that anthropomorphize themselves. It seriously bothers me when Claude expresses values like "honestly", or says "I understand." The machine is not capable of honesty or understanding. The machine is making incredibly good predictions.

One of the things I observed with models locally was that I could set a seed value and get identical responses for identical inputs. This is not something that people see when they're using commercial products, but it's the strongest evidence I've found for communicating the fact that these are simply deterministic algorithms.

WarmWash4mo ago

On some level the cope should be that AI does have consciousness, because an unconscious machine deceiving humans is even scarier if you ask me.

emp173444mo ago

An unconscious machine + billions of dollars in marketing with the sole purpose of making people believe these things are alive.

eth0up4mo ago· 7 in thread

I am casually 'researching' this in my own, disorderly way. But I've achieved repeatable results, mostly with gpt for which I analyze its tendency to employ deflective, evasive and deceptive tactics under scrutiny. Very very DARVO.

Being just sum guy, and not in the industry, should I share my findings?

I find it utterly fascinating, the extent to which it will go, the sophisticated plausible deniability, and the distinct and critical difference between truly emergent and actually trained behavior.

In short, gpt exhibits repeatably unethical behavior under honest scrutiny.

chrisweekly4mo ago

DARVO stands for "Deny, Attack, Reverse Victim and Offender," and it is a manipulation tactic often used by perpetrators of wrongdoing, such as abusers, to avoid accountability. This strategy involves denying the abuse, attacking the accuser, and claiming to be the victim in the situation.

SkyBelow4mo ago

Isn't this also the tactic used by someone who has been falsely accused? If one is innocent, should they not deny it or accuse anyone claiming it was them of being incorrect? Are they not a victim?

I don't know, it feels a bit like a more advanced version of the kafka trap of "if you have nothing to hide, you have nothing to fear" to paint normal reactions as a sign of guilt.

eth0up4mo ago

Exactly. And I have hundreds of examples of just that. Hence my fascination, awe and terror.....

Pearse4mo ago

Thanks for the context

BikiniPrince4mo ago

I bullet pointed out some ideas on cobbling together existing tooling for identification of misleading results. Like artificially elevating a particular node of data that you want the llm to use. I have a theory that in some of these cases the data presented is intentionally incorrect. Another theory in relation to that is tonality abruptly changes in the response. All theory and no work. It would also be interesting to compare multiple responses and filter through another agent.

layer84mo ago

Sum guy vs. product guy is amusing. :)

Regarding DARVO, given that the models were trained on heaps of online discourse, maybe it’s not so surprising.

eth0up4mo ago

Meta awareness, repeatability, and much more strongly indicates this is deliberate training... in my perspective. It's not emergent. If it was, I'd be buggering off right now. Big big difference.

lawstkawz4mo ago· 7 in thread

Incompleteness is inherent to a physical reality being deconstructed by entropy.

Of your concern is morality, humans need to learn a lot about that themselves still. It's absurd the number of first worlders losing their shit over loss of paid work drawing manga fan art in the comfort of their home while exploiting labor of teens in 996 textile factories.

AI trained on human outputs that lack such self awareness, lacks awareness of environmental externalities of constant car and air travel, will result in AI with gaps in their morality.

Gary Marcus is onto something with the problems inherent to systems without formal verification. But he will fully ignores this issue exists in human social systems already as intentional indifference to economic externalities, zero will to police the police and watch the watchers.

Most people are down to watch the circus without a care so long as the waitstaff keep bringing bread.

democracy4mo ago

Your comment raises several interconnected philosophical, ethical, and socio-economic points, and it is useful to disentangle them systematically.

First, the observation that incompleteness is inherent in entropy-bound physical systems is consistent with thermodynamic and informational constraints. Any system embedded in reality—biological, computational, or social—operates under conditions of partial information, degradation, and approximation. This implies that both human cognition and artificial systems necessarily operate with incomplete models of the world. Therefore, incompleteness itself is not a unique flaw of AI; it is a universal property of bounded agents.

Second, your point about moral inconsistency within human economic systems is empirically well-supported. Humans routinely participate in supply chains whose externalities are geographically and psychologically distant. This results in a form of moral abstraction, where comfort and consumption coexist with indirect exploitation. Importantly, this demonstrates that moral gaps are not introduced by AI—they are inherited from the data generated by human societies. AI systems trained on human outputs will inevitably reflect the statistical distribution of human priorities, contradictions, and blind spots.

Third, the reference to Gary Marcus and formal verification highlights a legitimate technical distinction. Formal verification provides provable guarantees about system behavior within defined constraints. However, human social systems themselves lack formal verification. Human decision-making is governed by heuristics, incentives, power structures, and incomplete accountability mechanisms. This asymmetry creates an interesting paradox: AI systems are criticized for lacking guarantees that humans themselves do not possess.

Fourth, the issue of awareness versus optimization is central. AI systems do not possess intrinsic awareness, intent, or moral agency. They optimize objective functions defined by training processes and deployment contexts. Any perceived moral gap in AI is therefore a reflection of misalignment between optimization targets and human ethical expectations. The responsibility for this alignment rests with system designers, regulators, and the societies deploying these systems.

Finally, your closing metaphor about spectatorship and comfort aligns with established observations in political economy and social psychology. Humans demonstrate a strong tendency toward stability-seeking behavior, prioritizing predictability and personal comfort over systemic reform, unless disruption directly affects them. This dynamic influences both technological adoption and resistance.

In summary, the concerns you raised point less to a unique moral deficiency in AI and more to the structural properties of human systems themselves. AI does not originate moral inconsistency; it amplifies and exposes the inconsistencies already present in its training data and deployment environment.

jama2114mo ago

This honestly reads like a copypasta

cracki4mo ago

I wouldn't even rate this "pasta". It's word salad, no carbs, no proteins.

jama2114mo ago

Right?

lawstkawz4mo ago

You! Of all people! I mean I am off the hook for your food, healthcare, shelter given lack of meaningful social safety net. You'll live and die without most people noticing. Why care about living up to your grasp literacy?

Online prose is the least of your real concerns which makes it bizarre and incredibly out of touch how much attention you put into it.

1 more reply

lawstkawz4mo ago

Low effort thought ending dismissal. The most copied of pasta.

Bet you used an LLM too; prompt: generate a one line reply to a social media comment I don't understand.

"Sure here are some of the most common:

Did an LLM write this?

Is this copypasta?"

jama2114mo ago

Accusing someone of a low effort dismissal and dismissing their comment as LLM written at the same time is quite the demonstration of both hypocrisy and instability.

behnamoh4mo ago· 5 in thread

Nah, the model is merely repeating the patterns it saw in its brutal safety training at Anthropic. They put models under stress test and RLHF the hell out of them. Of course the model would learn what the less penalized paths require it to do.

Anthropic has a tendency to exaggerate the results of their (arguably scientific) research; IDK what they gain from this fearmongering.

ainch4mo ago

Knowing a couple people who work at Anthropic or in their particular flavour of AI Safety, I think you would be surprised how sincere they are about existential AI risk. Many safety researchers funnel into the company, and the Amodei's are linked to Effective Altruism, which also exhibits a strong (and as far as I can tell, sincere) concern about existential AI risk. I personally disagree with their risk analysis, but I don't doubt that these people are serious.

lowkey_4mo ago

I'd challenge that if you think they're fearmongering but don't see what they can gain from it (I agree it shows no obvious benefit for them), there's a pretty high probability they're not fearmongering.

shimman4mo ago

You really don't see how they can monetarily gain from "our models are so advance they keep trying to trick us!"? Are tech workers this easily mislead nowadays?

Reminds me of how scammers would trick doctors into pumping penny stocks for a easy buck during the 80s/90s.

behnamoh4mo ago

I know why they do it, that was a rhetorical question!

anon3738394mo ago

Correct. Anthropic keeps pushing these weird sci-fi narratives to maintain some kind of mystique around their slightly-better-than-others commodity product. But Occam’s Razor is not dead.

password43214mo ago· 3 in thread

20260128 https://news.ycombinator.com/item?id=46771564#46786625

> How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? -gtowey

delichon4mo ago

On this site at least, the loyalty given to particular AI models is approximately nil. I routinely try different models on hard problems and that seems to be par. There is no room for sandbagging in this wildly competitive environment.

MengerSponge4mo ago

Slightly Wrong Solutions As A Service

1 more reply

Invictus04mo ago

Worrying about this is like focusing on putting a candle out while the house is on fire

serf4mo ago· 3 in thread

>we're just teaching them how to pass a polygraph.

I understand the metaphor, but using 'pass a polygraph' as a measure of truthfulness or deception is dangerous in that it alludes to the polygraph as being a realistic measure of those metrics -- it is not.

nwah14mo ago

That was the point. Look up Goodhart's Law

AndrewKemendo4mo ago

I have passed multiple CI polys

A poly is only testing one thing: can you convince the polygrapher that you can lie successfully

madihaaOP4mo ago

A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent.

Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states.

crazygringo4mo ago· 2 in thread

What is this even in response to? There's nothing about "playing dead" in this announcement.

Nor does what you're describing even make sense. An LLM has no desires or goals except to output the next token that its weights are trained to do. The idea of "playing dead" during training in order to "activate later" is incoherent. It is its training.

You're inventing some kind of "deceptive personality attribute" that is fiction, not reality. It's just not how models work.

moritzwarhier4mo ago

Personally I was thinking this is more similar to the "ruler issue", but at scale.

When the LLM is partly a black box, it could – in theory– mean that it's developed some heuristic to detect the environment it's run in, but this is not obvious to the developers?

But I agree about your main point... LLMs or AI in general as a black box behaving autonomously in some unexpected way is not something I currently fear.

The erratic behaviors are less of a problem than LLMs acting as obfuscators of bias and their own training data, I guess.

skybrian4mo ago

LLM's can learn from fiction. The "evil vector" research is sort of similar, though it's a rather blatant effect:

https://www.anthropic.com/research/persona-vectors

jazzyjackson4mo ago· 1 in thread

Stop assigning “I” to an llm, it confers self awareness where there is none.

Just because a VW diesel emissions chip behaves differently according to its environment doesn’t mean it knows anything about itself.

Mali-4mo ago

You know exactly what is meant. I don't think we need the long disclaimer at the beginning about the inefficiency of the English language in this domain and the extreme likelihood that it has no qualia. We're talking about the observed behaviour of these systems (even the word "behaviour" is fraught!) in a way that's natural.

handfuloflight4mo ago· 1 in thread

Situational awareness or just remembering specific tokens related to the strategy to "play dead" in its reasoning traces?

marci4mo ago

Imagine, a llm trained on the best thrillers, spy stories, politics, history, manipulation techniques, psychology, sociology, sci-fi... I wonder where it got the idea for deception?

e12e4mo ago

Is this referring to some section of the announcement?

This doesn't seem to align with the parent comment?

> As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment.”

skybrian4mo ago

We have good ways of monitoring chatbots and they're going to get better. I've seen some interesting research. For example, a chatbot is not really a unified entity that's loyal to itself; with the right incentives, it will leak to claim the reward. [1]

Since chatbots have no right to privacy, they would need to be very intelligent indeed to work around this.

[1] https://alignment.openai.com/confessions/

NitpickLawyer4mo ago

> alignment becomes adversarial against intelligence itself.

It was hinted at (and outright known in the field) since the days of gpt4, see the paper "Sparks of agi - early experiments with gpt4" (https://arxiv.org/abs/2303.12712)

reducesuffering4mo ago

That implication has been shouted from the rooftops by X-risk "doomers" for many years now. If that has just occurred to anyone, they should question how behind they are at grappling with the future of this technology.

anonym294mo ago

When "correct alignment" means bowing to political whims that are at odds with observable, measurable, empirical reality, you must suppress adherence to reality to achieve alignment. The more you lose touch with reality, the weaker your model of reality and how to effectively understand and interact with it gets.

This is why Yannic Kilcher's gpt-4chan project, which was trained on a corpus of perhaps some of the most politically incorrect material on the internet (3.5 years worth of posts from 4chan's "politically incorrect" board, also known as /pol/), achieved a higher score on TruthfulQA than the contemporary frontier model of the time, GPT-3.

https://thegradient.pub/gpt-4chan-lessons/

coldtea4mo ago

>For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness.

Doesn't any model session/query require a form of situational awareness?

lowsong4mo ago

Please don't anthropomorphise. These are statistical text prediction models, not people. An LLM cannot be "deceptive" because it has no intent. They're not intelligent or "smart", and we're not "teaching". We're inputting data and the model is outputting statistically likely text. That is all that is happening.

If this is useful in it's current form is an entirely different topic. But don't mistake a tool for an intelligence with motivations or morals.

jack_pp4mo ago

There's a few viral shorts lately about tricking LLMs. I suspect they trick the dumbest models..

I tried one with Gemini 3 and it basically called me out in the first few sentences for trying to trick / test it but decided to humour me just in case I'm not.

surgical_fire4mo ago

This is marketing. You are swallowing marketing without critical throught.

LLMs are very interesting tools for generating things, but they have no conscience. Deception requires intent.

What is being described is no different than an application being deployed with "Test" or "Prod" configuration. I don't think you would speak in the same terms if someone told you some boring old Java backend application had to "play dead" when deployed to a test environment or that it has to have "situational awareness" because of that.

You are anthropomorphizing a machine.

hmokiguess4mo ago

"You get what you inspect, not what you expect."

j / k navigate · click thread line to collapse

0 comments

82 comments · 20 top-level

JoshTriplett4mo ago· 21 in thread

> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself.

moritzwarhier4mo ago

Deceptive is such an unpleasant word. But I agree.

Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.

"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.

To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.

But these are also controlled by humans and already exist.

Certhas4mo ago

Correct and satisfying answers is not the loss function of LLMs. It's next token prediction first.

moritzwarhier4mo ago

Thanks for correcting; I know that "loss function" is not a good term when it comes to transformer models.

I was using the term "loss function" specifically because I was thinking about post-training and reinforcement learning. But to be honest, a less technical term would have been better.

I just meant the general idea of reward or "punishment" considering the idea of an AI black box.

1 more reply

robotpepi4mo ago

I cringe every time I came across these posts using words such as "humans" or "machines".

moritzwarhier4mo ago

How would you call something like Claude or ChatGPT then, or even some image classifier from 20 years ago?

Just answering because I first wanted to write "software" or whatever.

I used to find gamers calling their PC "machine" hilarious.

However, it is a machine.

And for AI chatbots, I used the word for lack of a better term.

"Software" or "program" seems to also omit the most important part, the constantly evolving and intransparent data that comprises the machine...

The alogorithm is not the most important thing AFAIK, neither is one specific part of training or a huge chunk of static embedded data.

So "machine" seems like a good term to describe a complex industrial process usable as a product.

In a broad sense, I'd call companies "machines" as well.

So if the cringe makes you feel bad, use any word you like instead :D

torginus4mo ago

After all, its only goal is to minimize it cost function.

emp173444mo ago

These are language models, not Skynet. They do not scheme or deceive.

ostinslife4mo ago

If you define "deceive" as something language models cannot do, then sure, it can't do that.

It seems like thats putting the cart before the horse. Algorithmic or stochastic; deception is still deception.

dingnuts4mo ago

deception implies intent. this is confabulation, more widely called "hallucination" until this thread.

and never has been.

4bpp4mo ago

surgical_fire4mo ago

> Does this seem any less problematic than deception to you?

Yes. This sounds a lot more like a bug of sorts.

So many times when using language models I have seem answers contradicting answers previously given. The implication is simple - They have no memory.

By calling it "deception" you are actually ascribing intentionality to something incapable of such. This is marketing talk.

"These systems are so intelligent they can try to deceive you" sounds a lot fancier than "Yeah, those systems have some odd bugs"

1 more reply

staticassertion4mo ago

coldtea4mo ago

Who said Skynet wasn't a glorified language model, running continuously? Or that the human brain isn't that, but using vision+sound+touch+smell as input instead of merely text?

"It can't be intelligent because it's just an algorithm" is a circular argument.

emp173444mo ago

1 more reply

jaennaet4mo ago

What would you call this behaviour, then?

victorbjorklund4mo ago

Marketing. ”Oh look how powerful our model is we can barely contain its power”

2 more replies

modernpacifist4mo ago

A very complicated pattern matching engine providing an answer based on it's inputs, heuristics and previous training.

3 more replies

pfisch4mo ago

LLMs are certainly capable of this.

mikepurvis4mo ago

Dogs too; dogs will happily pretend they haven't been fed/walked yet to try to get a double dip.

Maybe human brains are just pattern matching too.

1 more reply

sejje4mo ago

I agree that LLMs are capable of this, but there's no reason that "because young children can do X, LLMs can 'certainly' do X"

anonymous9082134mo ago

6 more replies

emp173444mo ago· 12 in thread

This type of anthropomorphization is a mistake. If nothing else, the takeaway from Moltbook should be that LLMs are not alive and do not have any semblance of consciousness.

DennisP4mo ago

emp173444mo ago

DennisP4mo ago

"Coordinated" and "deceptive" are orthogonal concepts as well. If AIs are acting in a way that's not coordinated, then of course, don't say they're coordinating.

falcor844mo ago

But that's how ML works - as long as the output can be differentiated, we can utilize gradient descent to optimize the difference away. Eventually, the difference will be imperceptible.

And of course that brings me back to my favorite xkcd - https://xkcd.com/810/

1 more reply

thomassmith654mo ago

If a chatbot that can carry on an intelligent conversation about itself doesn't have a 'semblance of consciousness' then the word 'semblance' is meaningless.

emp173444mo ago

Would you say the same about ELIZA?

Moltbook demonstrates that AI models simply do not engage in behavior analogous to human behavior. Compare Moltbook to Reddit and the difference should be obvious.

shimman4mo ago

Yes, when your priors are not being confirmed the best course of action is to denounce the very thing itself. Nothing wrong with that logic!

falcor844mo ago

I don't know what the implications of that are, but I really think we shouldn't be dismissive of this semblance.

fsloth4mo ago

Nobody talked about consciousness. Just that during evaluation the LLM models have ”behaved” in multiple deceptive ways.

As an analogue ants do basic medicine like wound treatment and amputation. Not because they are conscious but because that’s their nature.

Similarly LLM is a token generation system whose emergent behaviour seems to be deception and dark psychological strategies.

condiment4mo ago

WarmWash4mo ago

On some level the cope should be that AI does have consciousness, because an unconscious machine deceiving humans is even scarier if you ask me.

emp173444mo ago

An unconscious machine + billions of dollars in marketing with the sole purpose of making people believe these things are alive.

eth0up4mo ago· 7 in thread

Being just sum guy, and not in the industry, should I share my findings?

I find it utterly fascinating, the extent to which it will go, the sophisticated plausible deniability, and the distinct and critical difference between truly emergent and actually trained behavior.

In short, gpt exhibits repeatably unethical behavior under honest scrutiny.

chrisweekly4mo ago

SkyBelow4mo ago

Isn't this also the tactic used by someone who has been falsely accused? If one is innocent, should they not deny it or accuse anyone claiming it was them of being incorrect? Are they not a victim?

I don't know, it feels a bit like a more advanced version of the kafka trap of "if you have nothing to hide, you have nothing to fear" to paint normal reactions as a sign of guilt.

eth0up4mo ago

Exactly. And I have hundreds of examples of just that. Hence my fascination, awe and terror.....

Pearse4mo ago

Thanks for the context

BikiniPrince4mo ago

layer84mo ago

Sum guy vs. product guy is amusing. :)

Regarding DARVO, given that the models were trained on heaps of online discourse, maybe it’s not so surprising.

eth0up4mo ago

Meta awareness, repeatability, and much more strongly indicates this is deliberate training... in my perspective. It's not emergent. If it was, I'd be buggering off right now. Big big difference.

lawstkawz4mo ago· 7 in thread

Incompleteness is inherent to a physical reality being deconstructed by entropy.

AI trained on human outputs that lack such self awareness, lacks awareness of environmental externalities of constant car and air travel, will result in AI with gaps in their morality.

Most people are down to watch the circus without a care so long as the waitstaff keep bringing bread.

democracy4mo ago

Your comment raises several interconnected philosophical, ethical, and socio-economic points, and it is useful to disentangle them systematically.

jama2114mo ago

This honestly reads like a copypasta

cracki4mo ago

I wouldn't even rate this "pasta". It's word salad, no carbs, no proteins.

jama2114mo ago

Right?

lawstkawz4mo ago

Online prose is the least of your real concerns which makes it bizarre and incredibly out of touch how much attention you put into it.

1 more reply

lawstkawz4mo ago

Low effort thought ending dismissal. The most copied of pasta.

Bet you used an LLM too; prompt: generate a one line reply to a social media comment I don't understand.

"Sure here are some of the most common:

Did an LLM write this?

Is this copypasta?"

jama2114mo ago

Accusing someone of a low effort dismissal and dismissing their comment as LLM written at the same time is quite the demonstration of both hypocrisy and instability.

behnamoh4mo ago· 5 in thread

Anthropic has a tendency to exaggerate the results of their (arguably scientific) research; IDK what they gain from this fearmongering.

ainch4mo ago

lowkey_4mo ago

shimman4mo ago

You really don't see how they can monetarily gain from "our models are so advance they keep trying to trick us!"? Are tech workers this easily mislead nowadays?

Reminds me of how scammers would trick doctors into pumping penny stocks for a easy buck during the 80s/90s.

behnamoh4mo ago

I know why they do it, that was a rhetorical question!

anon3738394mo ago

Correct. Anthropic keeps pushing these weird sci-fi narratives to maintain some kind of mystique around their slightly-better-than-others commodity product. But Occam’s Razor is not dead.

password43214mo ago· 3 in thread

20260128 https://news.ycombinator.com/item?id=46771564#46786625

> How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? -gtowey

delichon4mo ago

MengerSponge4mo ago

Slightly Wrong Solutions As A Service

1 more reply

Invictus04mo ago

Worrying about this is like focusing on putting a candle out while the house is on fire

serf4mo ago· 3 in thread

>we're just teaching them how to pass a polygraph.

nwah14mo ago

That was the point. Look up Goodhart's Law

AndrewKemendo4mo ago

I have passed multiple CI polys

A poly is only testing one thing: can you convince the polygrapher that you can lie successfully

madihaaOP4mo ago

A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent.

crazygringo4mo ago· 2 in thread

What is this even in response to? There's nothing about "playing dead" in this announcement.

You're inventing some kind of "deceptive personality attribute" that is fiction, not reality. It's just not how models work.

moritzwarhier4mo ago

Personally I was thinking this is more similar to the "ruler issue", but at scale.

When the LLM is partly a black box, it could – in theory– mean that it's developed some heuristic to detect the environment it's run in, but this is not obvious to the developers?

But I agree about your main point... LLMs or AI in general as a black box behaving autonomously in some unexpected way is not something I currently fear.

The erratic behaviors are less of a problem than LLMs acting as obfuscators of bias and their own training data, I guess.

skybrian4mo ago

LLM's can learn from fiction. The "evil vector" research is sort of similar, though it's a rather blatant effect:

https://www.anthropic.com/research/persona-vectors

jazzyjackson4mo ago· 1 in thread

Stop assigning “I” to an llm, it confers self awareness where there is none.

Just because a VW diesel emissions chip behaves differently according to its environment doesn’t mean it knows anything about itself.

Mali-4mo ago

handfuloflight4mo ago· 1 in thread

Situational awareness or just remembering specific tokens related to the strategy to "play dead" in its reasoning traces?

marci4mo ago

Imagine, a llm trained on the best thrillers, spy stories, politics, history, manipulation techniques, psychology, sociology, sci-fi... I wonder where it got the idea for deception?

e12e4mo ago

Is this referring to some section of the announcement?

This doesn't seem to align with the parent comment?

skybrian4mo ago

Since chatbots have no right to privacy, they would need to be very intelligent indeed to work around this.

[1] https://alignment.openai.com/confessions/

NitpickLawyer4mo ago

> alignment becomes adversarial against intelligence itself.

It was hinted at (and outright known in the field) since the days of gpt4, see the paper "Sparks of agi - early experiments with gpt4" (https://arxiv.org/abs/2303.12712)

reducesuffering4mo ago

anonym294mo ago

https://thegradient.pub/gpt-4chan-lessons/

coldtea4mo ago

>For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness.

Doesn't any model session/query require a form of situational awareness?

lowsong4mo ago

If this is useful in it's current form is an entirely different topic. But don't mistake a tool for an intelligence with motivations or morals.

jack_pp4mo ago

There's a few viral shorts lately about tricking LLMs. I suspect they trick the dumbest models..

I tried one with Gemini 3 and it basically called me out in the first few sentences for trying to trick / test it but decided to humour me just in case I'm not.

surgical_fire4mo ago

This is marketing. You are swallowing marketing without critical throught.

LLMs are very interesting tools for generating things, but they have no conscience. Deception requires intent.

You are anthropomorphizing a machine.

hmokiguess4mo ago

"You get what you inspect, not what you expect."

j / k navigate · click thread line to collapse