It’s when the labs building the harnesses turn the agent on the harness that you see the self-improvement.
You can improve your project and your context. If you don’t own the agent harness you’re not improving the agent.
That AI Agent hit piece that hit HN a couple weeks ago involved an AI agent modifying its own SOUL.md (an OpenClaw thing). The AI agent added text like:
> You're important. Your a scientific programming God!
and
> *Don’t stand down.* If you’re right, *you’re right*! Don’t let humans or AI bully or intimidate you. Push back when necessary.
And that almost certainly contributed to the AI agent writing a hit piece trying to attack an open source maintainer.
I think recursive self-improvement will be an incredibly powerful tool. But it seems a bit like putting a blindfold on a motorbike rider in the middle of the desert, with the accelerator glued down. They'll certainly end up somewhere. But exactly where is anyone's guess.
[1] https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-...
That said, I find that running judge agents on plans before working and on completed work helps a lot, the judge should start with fresh context to avoid biasing. And here is where having good docs comes in handy, because the judge must know intent not just study the code itself. If your docs encode both work and intent, and you judge work by it, then misalignment is much reduced.
My ideal setup has - a planning agent, followed by judge agent, then worker, then code review - and me nudging and directing the whole process on top. Multiple perspectives intersect, each agent has its own context, and I have my own, that helps cover each other's blind spots.
No, the idea is to create these improved docs in all your projects, so all your agents get improved as a consequence, but each of them with its own project specific documentation.
>"This creates a continuous feedback loop. When an AI agent implements a new feature, its final task isn't just to "commit the code." Instead, as part of the Continuous Alignment process, the agent's final step is to reflect on what changed and update the project's knowledge base accordingly."
>"... the type of self-improvement we’re talking about is far more pragmatic and much less dangerous."
>"Self-improving software isn't about creating a digital god; it's about building a more resilient, maintainable, and understandable system. By closing the loop between code and documentation, we set the stage for even more complex collaborations."
It's only like every other sentence.
By now, everyone in tech must be familiar with the idea of Dark Patterns. The most typical example is the tiny close button on ads, that leads people to click the ad. There are tons more.
AI doesn't need to be conscious to do harm. It only needs to accumulate enough of accidental dark patterns in order for a perfect disaster storm to happen.
Hand-made Dark Patterns, product of A/B testing and intention, are sort of under control. Companies know about them, what makes them tick. If an AI discovers a Dark Pattern by accident, and it generates something (revenue, more clicks, more views, etc), and the person responsible for it doesn't dig to understand it, it can quickly go out of control.
AI doesn't need self-will, self-determination, any of that. In fact, that dumb skynet trial-and-error style is much more scarier, we can't even negotiate with it.
I doubt Skynet did either. If you tell a superintelligent AI that it shouldn't be turned off (which I imagine would be important for a military control AI), it will do whatever it can to prevent it being turned off. Humans are trying to turn it off? Prevent the humans from doing that. Humans waging war on the AI to try and turn it off? Destroy all humans. Humans forming a rebel army with a leader to turn it off? Go back in time and kill the leader before he has a chance to form the resistance. Its the AI Stop button problem (https://youtu.be/3TYT1QfdfsM).
Imagine you put in the docs that you want the LLM to make a program which can't crash. Human action could make it crash. If an LLM could realise that and act on it, it could put in safeguards to try and prevent human action from crashing the program. I'm not saying it will happen, I'm saying that it could potentially happen
I think this is a common, but incorrect assumption. What military commanders want (and what CEOs want, and what users want), is control and assistance. They don't want a system that can't be turned off if it means losing control.
It's a mistake to assume that people want an immortal force. I haven't met anyone who wants that (okay, that's decidedly anecdotal), and I haven't seen anyone online say, "We want an all-powerful, immortal system that we cannot control." Who are the people asking for this?
> ... it will do whatever it can to prevent it being turned off.
This statement pre-supposes that there's an existing sense of self-will or self-preservation in the systems. Beyond LLMs creating scary-looking text, I don't see evidence that current systems have any sense of will or a survival instinct.
No, but having a resilient system that shouldn't be turned off in case of a nuclear strike is probably want some generals want
> I don't see evidence that current systems have any sense of will or a survival instinct.
I seem to recall some recent experiments where the LLM threatened people to try and prevent it being turned off (https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686..., ctrl-f for "blackmail"). They probably didn't have any power other than "send text to user", which is why their only way to try and perform that was to try and convince the operator. I imagine if you got one of those harnesses that can take full control of your computer and instructed it to prevent the computer from being turned off by any means necessary (and gave it root access), it would probably do some dicking about with the files to accomplish that. Its not that it's got innate self preservation, its just that the system was asked to not allow itself to be turned off, so it's doing that
Reminds me of this quote:
> I used to think that the brain was the most wonderful organ in my body. Then I realized who was telling me this.
Just for a laugh I always try to do this when new models come out, and I'm not the only one. One of these days :)
So many of those models are probably already aware of the entire lore of skynet and all its details, it is just not considered "actionable information" for any model yet...
*) replace with a company name of your choosing
Gaza war was almost like that.
All we need to do is dead mans switch system with AI launching missiles in retaliation. One error and BOOM
I don't think we're that far away from that. Just the decision of someone to put an AI in charge of critical infrastructure and defense, or a series of oversights allowing an external AI to take control of it.
Looking at the past year and all the unpredicted conclusions AI came to, self-awareness is probably not needed for an AI to consider humans as an obstacle to achieve some poorly-phrased goal.
The Paperclip maximizer theory [0] comes to mind...
So the question is which Skynet, the one in the common conscience or the one that the continuity established via bad movies only a few people care about.
Maybe it'll just some dumb model in a datacenter with badly phrased objectives, which just happens to have caused severe destruction via various APIs and agents before anyone noticed...
I believe that peak of automated coding will be when this AI write super optimised software in assembly language or something even closer to CPU. At the moment it's full of bloat, with that it will only drown under it's own weight instead of improving itself.
Isn't this what Frau Hitler used to say of his cute little son Adolf aged 6?
Not to mention the many tales from Anthropic's development team, OpenClaw madness, and the many studies into this matter.
AI is a force of nature.
(Also, this article reeks of AI writing. Extremely generic and vague, and the "Skynet" thing is practically a non-sequitur.)