Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku (opens in new tab)

(anthropic.com)

1454 pointsweirdcat1y ago735 comments

735 comments

271 comments · 127 top-level

LASR1y ago· 13 in thread

This is actually a huge deal.

As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.

I wanted to take at stab at this problem and started researching some daily busineses and how they use software.

My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.

My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.

I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

Boy was I wrong.

I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.

aduffy1y ago

This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.

Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.

7 more replies

TeMPOraL1y ago

> Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.

In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.

Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.

[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.

pants21y ago

Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.

3 more replies

aledalgrande1y ago

Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.

2 more replies

DrScientist1y ago

Really good software has automation built in, in the form of macros/scripting.

One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.

Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).

brabel1y ago

I tried once to integrate with software used by pharmacists in Australia (which costs a fortune and looks straight out of the 90's). Even though they have a SQL database with all information about everything, the DB is encrypted and they provide no way to access it. I tried contacting the company to learn how I could integrate with it but they offered no help. It's not in their interest to help upcoming competitors and they claim they cannot provide the encryption key as that's for protecting the customer's privacy, forgetting that the pharmacists already have access to all data through their software. But without a way to automatically extract information from the DB, there's no way to migrate to new software as no pharmacist would want to use new software if it cannot "import" the data they need.

It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.

speleding1y ago

I'm a bit skeptical about this working well enough to handle exceptions as soon as something out of the ordinary occurs. But it seems this could work great for automated testing.

Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"

That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.