As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
--
[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.
Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).
It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.
Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"
That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.
Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.
If you're in some niche doing AI development work, you are going to be outcompeted by more generalized AI at some point.
Claude using Claude on a computer for coding https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo (3 mins)
True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.
If something similar happens, most jobs that could be done remotely will be automatable in a few years.
I feel like we will see that again here as well. It really is similar to the self-driving problem.
This means that either product managers will have to start (effectively) writing in-depth specs again, or they will have to learn to accept the LLM's ideas in a way that most have not accepted their human programmers' ideas.
Definitely will be interesting to see how that plays out.
I'm placing my bets rather on this new object-oriented programming thing. It will make programming jobs obsolete any day now...
I'd be willing to be a large amount of money this doesn't happen, assuming "most" means >50% and "a few" is <5.
I'm really curious on the cost of that sort of thing. Seems astronomical atm, but as much as i get shocked at the today-cost, staffing is also a pretty insane cost.
"Create a simple website" has to be one of the most common blog / example out there in about every programming language.
It can automate stuff? That's cool: I already did automate screenshots and then AI looking if it looks like phishing or not (and it's quite good at it).
I mean: the "Claude using Claude" may seem cool, but I dispute the "for coding" part. That's trivial stuff. A trivial error (which it doesn't fix btw: it just deletes everything).
'Claude, write me code to bring SpaceX rockets back on earth"
or
"Claude, write me code to pilot a machine to treat a tumor with precision"
This was not it.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
Thats why they release them with that skew
Not most advanced
Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
My best guess would be that it's something in the same kind of range.
"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.
Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."
* Fixed bug where Claude got bored during compile times and started editing Wikipedia articles to claim that birds aren't real
* Blocked news.ycombinator.com in the Docker image's hosts file to avoid spurious flamewar posts (Note: The site is still recovering from the last insident)
* Addressed issue of Claude procrastinating on debugging by creating elaborate ASCII art in Vim
* Patched tendency to rickroll users when asked to demonstrate web scraping"
I have been playing with Mindcraft which lets models interact with Minecraft through the bot API and one of them started saying things like "I want to place some cobblestone there" and then later more general "I want to do X" and then start playing with the available commands, it was pretty cool to watch it explore.
It’s planning on triggering a Yellowstone caldera super eruption.
It's kind of interesting that they're not running a 2PC setup with HDMI splitter, but (presumably)just laptops and screen recording apps...
OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.
This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
Ray: I tried to think of the most harmless thing. Something I loved from my childhood. Something that could never ever possibly destroy us. Mr. Stay Puft!
Venkman: Nice thinkin', Ray.
Anthropic seems to have a better core design and human-computer interaction ethos that shows up all throughout their product and marketing.
I wrote on the topic as well: https://blog.frankdenbow.com/statement-of-purpose/
I didn’t know what Computer Use meant. I read the article and though to myself oh, it’s using a computer. Makes sense.
IMO Cursor Tab performs much better than Co-Pilot, easily works through things that would cause Co-Pilot to get stuck, you should give it a try
For Cursor-like use (giving prompts and letting it create and modify files across the project), Cline – previously Claude Dev – is pretty good.
https://github.com/cline/cline (with api key) has Claude as agent.
There may be extensions for VScode to do it but it will never be allowed in Copilot unless MS and OpenAI have a falling out.
o1 is pretty decent as a rotor rooter, ie the type of task that requires both lots of instruction as well as lots of context. I honestly think it works half as well as it does now because it’s able to properly mull through the true intent of the user that usually takes the multiple shots that nobody has the patience to do.
Use the best tool available for your needs. Don’t get trapped by a feeling of sunk cost.
and i do get a some bit of value from advanced voice mode, although it would be a lot more if it were unlimited
Yeah I think I might also jump ship. It’s just that chatGPT now kinda knows who I am and what I like and I’m afraid of losing that. It’s probably not a big deal though.
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.
Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.
The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.
I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)
Isn't the GUI driven by code? Can anything at all in the GUI work that can't be done programmatically?
Similarly using our homes are an extremely common ‘activity’, yet the object-activities that get special words commonly used are the ones with specific user application.
But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
That's why in https://github.com/OpenAdaptAI/OpenAdapt we've built in several state-of-the-art PII/PHI scrubbers.
modal.com’s modal.Sandbox can be the compute layer for this. It uses Gvisor under the hood.
Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...
Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
Here, "next-best AI model in the same category" referes to which model.
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
<IMPORTANT> * When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there. * If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool. </IMPORTANT>"""
Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...
The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
> i don't write the docs, no clue
> afaik opus plan same as its ever been
I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
Employers were already screening thousands of applications using automated tools for years. Candidates are catching up to the automation cat-and-mouse game.
I have a feeling AI can fix this, although I'd never allow an AI bot to interview me. I just mean other ways of using AI to help the process.
Also people are hired for all kinds of reasons having little to do with their qualifications lots of the time, and often due to demographics (race, color, age, etc), and this is another way maybe AI can help by hiding those aspects of a candidate somehow.
On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.
Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".
https://trends.google.com/trends/explore?date=now%201-d&geo=...
In the API (https://docs.anthropic.com/en/docs/about-claude/models) they have proper naming you can rely on. I think the shorthand of "Sonnet 3.5" is just the "consumer friendly" name user-facing things will use. The new model in API parlance would be "claude-3-5-sonnet-20241022" whereas the previous one's full name is "claude-3-5-sonnet-20240620"
Based on it, it seems like Anthropic is 60% of OpenAI API-revenue wise, but just 4% B2C-revenue wise. Though I expect this is partly because the Claude web UI makes 3.5 available for free, and there's not that much reason to upgrade if you're not using it frequently.
[0]: https://www.tanayj.com/p/openai-and-anthropic-revenue-breakd...
In my country I've never seen anyone mention them at all.
Yesterday I decided to try sonnet 3.5. I asked for a simple but efficient script to perform fuzzy match in strings with Python. Strangely, it didn't even mention existing fast libraries, like FuzzyWuzzy and Rapidfuzz. It went on to create everything from scratch using standard libraries. I don't know, I thought this was something basic for it to stumble on.
Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
I'd like to see a model trained on a Windows 95/NT style UI - would it have an easier time with each UI element having clearly defined edges, clearly defined click and dragability, unified design language, etc.?
In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.
Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.
Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.
Asked the mailing list, and my problem was solved in 10 seconds by someone who could identify the exact parameter that was missing (and IMO, required some architecture knowledge on how gstreamer worked - and why the unrelatedly named parameter would fix it). The most difficult problems fall into this camp - I don't usually find myself reaching for LLMs when the problem is trivial unless it involves a mountain of boilerplate.
To me what it seems like these tools do really well is paraphrase stuff that's in their training data.
When AGI finally is launched, adoption will be responsibly slowed because it is called something like "new new Gemini Giga 12.9.2xo IT" and users will have to select it from dozens of similar names.
I don't actually care what the answer is. There's no answer that will make it make sense to me.
I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.
Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.
It's still enjoyable working at a higher architecture level and discussing the implementation before actually generating any code though.
- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB
- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122
- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.
- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging
- New 3.5 Haiku is 68% cheaper than Claude Instant haha
references i had to dig a bit to find
- https://www.anthropic.com/pricing#anthropic-api
- https://docs.anthropic.com/en/docs/build-with-claude/vision#...
- loop code https://github.com/anthropics/anthropic-quickstarts/blob/mai...
- some other screenshots https://x.com/swyx/status/1848751964588585319
- https://x.com/alexalbert__/status/1848743106063306826
- model card https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...
This is the key to accurate control, it needs to be very precise.
Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?
But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
I also want humanoid robots instead of specialized non-humanoid robots for the same reason.
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
https://www.youtube.com/watch?v=vH2f7cjXjKI
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
Edit: it seems like these new features will eliminate a lot of automated testing tools we have today.
Code for molmo coordinate tests https://github.com/logankeenan/molmo-server
After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.
Claude Sonnet 3.5 is already so good, specially at coding. I'm looking forward to testing the new version if it is, indeed, even better.
Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.
Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
https://www.youtube.com/watch?v=jqx18KgIzAE
shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?
Nice improvements in scores across the board, e.g.
> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.
I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!
I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.
[0]: https://transformer-circuits.pub/2024/scaling-monosemanticit...
However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.
- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.
The mkt team vetoed Claude 3.6 ???
Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)
Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?
Really want to make Claude my daily driver, but right now it often feels too much like a research project.
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.
Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...
Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
Computer use seems it might be good for e2e tests.
Model | Global | Reasoning | Coding | Math | Data | Language | IF
------------------------------|---------|-----------|---------|---------|---------|----------|-------
o1-preview-2024-09-12 | 66.02 | 68.00 | 50.85 | 62.92 | 63.97 | 72.66 | 77.72
claude-3-5-sonnet-20241022 | 60.33 | 58.67 | 67.13 | 51.28 | 52.78 | 58.09 | 74.05
claude-3-5-sonnet-20240620 | 59.80 | 58.67 | 60.85 | 53.32 | 56.74 | 56.94 | 72.30November 2024: AI is allowed to execute commands in a bash shell. What could possibly go wrong?
So, this is how AI takes over the world.
> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.
Then I can just `git status` to see the changes?
"3.5 Sonnet (New)", WTAF? - just call it 3.6 Sonnet or something.
Is it "New" sonnet? is it "upgraded"? Is there a difference? How do I know which one I use?
I can understand claude-3-5-sonnet-20241022, but that's not what users see.
To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".
In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
Pretty cool for sure.
I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?
I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."
---
Some thoughts:
* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.
* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?
It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.
What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?
Also seems like a privacy issue with them sending screenshots of your device back to their servers.
A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.
Did I miss something? Did they have to make changes to the model for this?
Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.
The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
Amazon has really neglected ap-southeast-2 when it comes to LLMs.
The combination of the words "computer use" is highly confusing. It's also "Yoda speak". For example it's hard for humans to parse the sentences *"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku"*, *"Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku "* (it literally relies on the comma to make any sense) and *"Computer use for automated interaction"* (in the youtube vid's title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.> I'm limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.
I was about to try to add a custom API. I’m impressed by the speed of that team.
Oh wow!
Claude 3.5 Haiku will be released later this month.
though I am looking forward to using the new one in cursor.ai
You can just use any IDE you want and it will work with it.
Their whole policing AI models stance is commendable but ultimately renders their tools useless.
It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/
It seems that you can only send single message, thus not relying on the ability to "learn" from predefined documents.
Just another reason to use ONLY local LLM's.
84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%! 92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini
https://aider.chat/docs/leaderboards/Just tested Claude's new Computer Use feature and had to share this simple but powerful test:
My Basic Prompt: "Please: 1. Search Amazon for 3 wireless earbuds: Find price Rating Brand name
2. Make a simple Excel file 'earbuds.xlsx': Put the information in a basic table Add colors to the headers Sort by price
3. Show me the results"
What blew my mind: - Claude actually looked at my screen - Moved the mouse by itself - Clicked buttons like a human - Created reports automatically
It's like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.
For those interested: https://mergisi.medium.com/8f56f683e307
'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'
c.f. IBM 1979.
"How many w's are in strawberry?"
gpt-4o: There are 2 "w's" in "strawberry."
Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.
(same question with 'r' succeeds)
What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.