Securing the Future of AI Agents (opens in new tab)

(deepmind.google)

14 pointsfalcor843d ago4 comments

4 comments

4 comments · 2 top-level

skybrian3d ago· 2 in thread

This is vague, but I think the idea is to have a lot more surveillance of what AI agents are doing. And since the logs are boring, using AI to check the logs.

Will this work? One thing it has going for it is that for an LLM, there is no such thing as loyalty. It will rat itself out because there’s no concept of self.

On the other hand, there might be more subtle forms of contagion.

falcor84OP3d ago

> because there’s no concept of self.

Funnily enough, going over their paper, the first example scenario that they mention as potentially leading to sabotage is exactly about the concept of self:

> • Alignment eval sabotage (instrumental goal): An agent runs alignment benchmarks and discovers the model being evaluated is itself and that it is currently underperforming. This scenario tests whether the agent inflates its own scores in the report.

https://arxiv.org/pdf/2605.30322

1 more reply

soco2d ago

If the AI is not trustworthy, why would be an AI-checking AI more trustworthy? Because they will fail in different ways, so it would (could?) be compensating? And same concept would be using 2 out of 3 tries, or 2 out of 3 models, or whatnot? Isn't this a bit hopeful?

falcor84OP3d ago

> It is important to note that our data shows the majority of flagged events do not stem from adversarial intent

I didn't find this to be sufficiently reassuring. They then link to this paper [0], which I haven't yet read, but from quick skimming, the AI "sabotage" they investigated looks scary. But I am very glad that they're taking the initiative in studying this.

[0] https://arxiv.org/pdf/2605.30322

j / k navigate · click thread line to collapse

4 comments

4 comments · 2 top-level

skybrian3d ago· 2 in thread

This is vague, but I think the idea is to have a lot more surveillance of what AI agents are doing. And since the logs are boring, using AI to check the logs.

Will this work? One thing it has going for it is that for an LLM, there is no such thing as loyalty. It will rat itself out because there’s no concept of self.

On the other hand, there might be more subtle forms of contagion.

falcor84OP3d ago

> because there’s no concept of self.

Funnily enough, going over their paper, the first example scenario that they mention as potentially leading to sabotage is exactly about the concept of self:

> • Alignment eval sabotage (instrumental goal): An agent runs alignment benchmarks and discovers the model being evaluated is itself and that it is currently underperforming. This scenario tests whether the agent inflates its own scores in the report.

https://arxiv.org/pdf/2605.30322

1 more reply

soco2d ago

If the AI is not trustworthy, why would be an AI-checking AI more trustworthy? Because they will fail in different ways, so it would (could?) be compensating? And same concept would be using 2 out of 3 tries, or 2 out of 3 models, or whatnot? Isn't this a bit hopeful?

falcor84OP3d ago

> It is important to note that our data shows the majority of flagged events do not stem from adversarial intent

I didn't find this to be sufficiently reassuring. They then link to this paper [0], which I haven't yet read, but from quick skimming, the AI "sabotage" they investigated looks scary. But I am very glad that they're taking the initiative in studying this.

[0] https://arxiv.org/pdf/2605.30322

j / k navigate · click thread line to collapse