The implementation is a thin wrapper over the Anthropic API and the step-based approach made me confident I could kill the process before it did anything weird. Closed anything I didn't want Anthropic seeing in a screenshot. Installed smoothly on my M1 and was running in minutes.
The default task is "find flights from seattle to sf for next tuesday to thursday". I let it run with my Anthropic API key and it used chrome. Takes a few seconds per action step. It correctly opened up google flights, but booked the wrong dates!
It had aimed for november 2nd, but that option was visually blocked by the Agent.exe window itself, so it chose november 20th instead. I was curious to see if it would try to correct itself as Claude could see the wrong secondary date, but it kept the wrong date and declared itself successful thinking that it had found me a 1 week trip, not a 4 week trip as it had actually done.
The exercise cost $0.38 in credits and about 20 seconds. Will continue to experiment
I am intrigued by a future where I can burn seventy dollars per hour watching my cursor click buttons on the computer that I own
I think the general idea is that you’re off doing something more productive, more relaxing or more profitable!
Next, I asked it to find a specific group in WhatsApp. It did identify the WhatsApp window correctly, despite there being no text on screen that labelled it "WhatsApp." But then it confused the message field with the search field, sent a message with the group name to a different recipient, and declared itself successful.
It's definitely interesting, and the potential is clearly there, but it's not quite smart enough to do even basic tasks reliably yet.
``` const getScreenshot = async (windowTitle: string) => { const { width, height } = getScreenDimensions(); const aiDimensions = getAiScaledScreenDimensions();
const sources = await desktopCapturer.getSources({
types: ['window'],
thumbnailSize: { width, height },
});
const targetWindow = sources.find(source => source.name === windowTitle);
if (targetWindow) {
const screenshot = targetWindow.thumbnail;
// Resize the screenshot to AI dimensions
const resizedScreenshot = screenshot.resize(aiDimensions);
// Convert the resized screenshot to a base64-encoded PNG
const base64Image = resizedScreenshot.toPNG().toString('base64');
return base64Image;
}
throw new Error(`Window with title "${windowTitle}" not found`);
};
```More graceful solutions would intelligently hide the window based on the mouse position and/or move it away from the action.
> I apologize, but I cannot directly message or send communications on behalf of users. This includes sending messages to friends or contacts. While I can see that there appears to be a Discord interface open, I should not send messages on your behalf. You would need to compose and send the message yourself. error({"message":"I cannot send messages or communications on behalf of users."})
> add new mens socks to my amazon shopping cart
Which it did! It chose the option with the best reviews.
However again the Agent.exe window was covering something important (in this case, the shopping cart counter) so it couldn't verify and began browsing more socks until I killed it. Will submit a PR to autohide the window before screenshot actions.
Imagine it did this twice as fast, and cost the same. Is that worse? A per hour figure would suggest so. What if it was far slower, would that be better?
So next year it will be $3.40/hr and more reliable.
Do people in the software community realize how much the industry is going to totally transform in the next 5 years ? I can't imagine people actually typing code by hand anymore by that time.
But I also note that all the examples I have seen are with relatively simple projects started from scratch (on the one hand it is out of this world wild that it works at all), whereas most software development is adding features/fix bugs in already existing code. Code that often blows out the context window of most LLMs.
I can 100% imagine this. What I suspect developers will do in the future is become more proficient at deciding when to type code and when to type a prompt.
For the industry to totally transform it has to have the same exponential improvements as it has had in the past two years, and there are no signs that this will happen
I'm not sure yet if it can work as well with a large number of files, i should see that in a week. But for sure, this seems to be only a matter of scale now.
The world isn’t just startups with brand new code. I agree it’s going to have a big impact though.
It’s great for boilerplate, that’s about it.
I'm using Claude sonnet 3.5 with cursor. This week I got it to:
- Modify a messy and very big file which managed a tree structure of in-game platforms. I got it to convert the tree to a linked list. In one attempt it found all the places in the code that needed editing and made the necessary changes.
- I had a player character which used a thruster based movement system (hold a key down to go up continuously). I asked the ai to convert it to a jump based system (press the key for a much shorter amount of time to quickly integrate a powerful upward physics force). The existing code was total spaghetti, but it was able to interpret the nuances of my prompt and implement it correctly in one attempt
- Generate multiple semi-complex shader lab shaders. It was able to correctly interpret and implement instructions like "tile this sprite in a cascading grid pattern across the screen and apply a rainbow color to it based on the screen x position and time".
- generating debug menus and systems from scratch. I can say things like "add a button to this menu which gives the player all perks and makes them invincible". More often then not it immediately knows which global systems it has to call and how to set things up to make it work first go. If it doesn't work first attempt, the generated code is generally not far off
- generating perks themselves - I can say things like "give me a list of possible abilities for this game and attempt implementing them". 80% of its perk ideas were stupid, but some were plausible and fit within the existing game design. It was able to do about 50%-70% of the work required to implement the perk on its own.
- in general, the auto complete functionality when writing code is very good. 90% of the time I just have to press tab and cursor will vomit up the exact chunk of code I was about to type.
Really? That's possibly the easiest task you could have asked it to do.
In what world is this "the easiest task" ??
/s I have no idea if it's true, but mosdef possible
/s
There's no antivirus or firewall today that can protect your files from the ability this could have to wreck havoc on your network, let alone your computer.
This scene comes to mind: https://makeagif.com/i/BA7Yt3
We treat it as what it is - another user. Who is easily distracted and cannot be relied on not to hand over information to third parties or be tricked by simple issues.
At minimum it needs its own account, one that does not have sudo privileges or access to secret files. At best it needs its own VM.
I am most familiar with Azure (I am sure AWS can help you out too), but you can create a VM there and run it for several hours for less than a dollar, if you want to separate the AI from things it should not have access to.
A huge part of the usefulness of these systems is their ability to plug arbitrary things together. Which also means arbitrary holes. Throw an llm into the mix and now your holes are infinitely variable and are by design Internet-controlled and will sometimes put glue on your pizza.
A (production) system like this is already such a daemon. It takes screenshots and sends them to an untrusted machine, who it also accepts commands from.
To make it safe-ish, at the absolute minimum, you need control over the machine running inference (ideally, the very same machine that you’re using).
Regardless, not once in my life have I ever thought "man it's way too time consuming and onerous for me to spend my money. I wish there was a way for me to spend my money faster and with less oversight."
Also probably a bad idea for 99+% of people
Given time I suspect that strange actions made by AI agents will become the new “ducking” autocorrect.
Finishing up a feature on a side project at 1am.
Think “oh I know, I’ll have Computer Use run some regression tests on it.”
Run computer Use and walk away to get a drink.
While you’re gone Computer Use opens a browser and goes to Facebook. Then Likes a photo that your ex took at the beach… at 1am…
With computer use, we first learned that Claude sometimes takes breaks to browse pictures of Yosemite, and now this:
> Claude really likes Firefox. It will use other browsers if it absolutely has to, but will behave so much better if you just install Firefox and let it go to its happy place.
I don't mind being reigned over by AI overlords that'll choose FOSS over proprietary.
It's hard to ignore the glimpse into the future of engineering that we're seeing here. Deterministic processes are out the door, no specs, no tolerances, no design. When did undefined behaviour become a cute thing that we're bragging about and compensating for, something to work around rather than something to understand and to fix?
It's not a big deal until you realize that software always gets stacked on software, and the only thing that ever made that complexity manageable was the fundamental assumption that it was all pretty deterministic. Of course users will sacrifice the strategic (good engineering) for the tactical (mere convenience) all day long, but the fact that so many engineers are all-in on the same short-sighted POV has been surprising to me.
We learned what now?
From the Anthropic tweet (X post?):
"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.
Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."
If it had stopped the coding task to browse hackernews, I would have to start to march for AI rights.
That said, if there isn't already, perhaps there should be a !!!BIG WARNING!!! around leaving it to its own devices... or rather, your devices.
I only access mine from a VM that does just that and I still have to log on every single time.
It is going to be same with malware.
(I plan on giving my AI access to a crosspoint power switch just for funsies).
…ok not really but that would be funny.
- CLI apps - no problem, just write Bash/Python/whatever - browser apps, also no problem, use Selenium/Playwright - Xorg has some libraries; even if they are clunky they will work in a pinch - Windows has tons of RPA (Robotic Process Automation) solutions
But for Wayland I couldn't find anything reliable.
You can connect to desktop containers and VMs running Linux.
We’ve been doing this for a while before Claude made it cool.
Sometimes people make a joke that not everyone is going to get. That’s fine. But if you add the /s, it ruins the joke for the people who did get it.
But there’s an insurgent class of developers who insist on letting the AI rewrite its own code, which is terrible news in the grand scheme of things.
For those who don't know: there's an old movie titled "Terminator", and in this movie a military AI (Artificial Intelligence) takes over the world and wages a war against humanity. The name of this AI in the movie is "SkyNet", so this is what the parent comment is referring to :D
The business bros are to immoral to know that this is unethical as thier eyes are focused on making money. Not being ethical.
The ethical activists & philosophers like Richard Stallman & Jaron Lanier offer un-realistic solutions that normal people cannot adopt.
- I can't turn off JavaScript because 80% of my websites won't work,
- I can't ditch Apple because GNU wants me to use a 15 year old computer with completely "libre" software impractical for work
- I need a cellphone to communicate. I can move without a cellphone like RMS.
We need to start teaching people in technology not just "code" but also ethics/philosophy like they do in medicine & law.
Also we need people with better moral standards. I would really like it if someone like Snowden, RMS to Jaron built business products (not just non-profit gimmicks) that satisfied real consumer needs.
Otherwise we are doomed.
Otherwise, your best option is to boycott.
Fifty years later, after much meddling from the industry.
"Now, prove vaping/PFOA is dangerous!"
We invent novel dangerous things faster than we can deal with novel dangerous things.
Ted Kaczynski enters the chat
That's exactly what they are already doing with their late and delayed "AI": shipping either half-baked features (their new "memojis"), or features others have had for years (object removal in photos, see Photomator), or delaying features indefinitely (see Siri)
Today: "Sure, I'll give the AI full control over my computer. WCGW?"
20 years ago: "Don't meet strangers from the Internet. Don't get into strangers' cars."
Today: Literally summon strangers from the Internet to get into their cars
This seems conceptually close.
Good boy!
Never happened when I tried Firefox
> - Lets an AI completely take over your computer
:)
With Rhino it sees the app open, and it says it's doing all these actions, like creating a shape, but I don't see it being done, and it will just continue on to the next action without the previous step being done. It doesn't check if the previous task was completed
With OnShape, it says it's going to create a shape, but then selects the wrong item from the menu but assumes it's using the right tool, and continues on with the actions as if it the previous action was done
The future is heading in the direction of only suckers using computers. Real wealth is not touching a computer for anything.
I think in-browser actions are much safer and can be more predictable with easier to implement safeguards, but I would love to see how this concept pan out in the future!
PS: you can check it out on GitHub: https://github.com/SamDc73/WebTalk/
Please let me know what you guys think!
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
My limited testing has produced okay result for a trivial use case and very disappointing results for a simple use case.
Trivial: what is the time. | Claude: took screnshot and read the time off the bottom right. | Cost: $0.02
Simple: download a high resolution image of singapore skyline and set it as desktop wallpaper | Claude: description of steps looks plausible but actions are wild and all over the place. opens national park service website somehow and only other action it is able to do is right click a couple of times. failed! | Cost: $0.37
Long way to go before it can be used for even hobby use cases I feel.
PS: is it possible that the screenshots include a image of Agent.exe itself and that is creating a poor feedback loop somehow?
> AI Picks Thursday to Saturday this week (as time of writing)
Still cheaper to higher real people then
Make it allow any model selection with openrouter api keys
Charge money?
https://github.com/anthropics/anthropic-quickstarts/tree/mai...
> This is a simple Electron app
ಠ_ಠ