Claude Computer Use – Is Vision the Ultimate API? (opens in new tab)

(thariq.io)

113 pointstrq_1y ago90 comments

90 comments

62 comments · 15 top-level

CharlieDigital1y ago· 25 in thread

Vision is the ultimate API.

The historical progression from text to still images to audio to moving images will hold true for AI as well.

Just look at OpenAI's progression as well from LLM to multi-modal to the realtime API.

A co-worker almost 20 years ago said something interesting to me as we were discussing Al Gore's CurrentTV project: the history of information is constrained by "bandwidth". He mentioned how broadcast television went from 72 hours of "bandwidth" (3 channels x 24h) per day to now having so much bandwidth that we could have a channel with citizen journalists. Of course, this was also the same time that YouTube was taking off.

The pattern holds true for AI.

AI is going to create "infinite bandwidth".

swatcoder1y ago

> The historical progression from text to still images to audio to moving images will hold true for AI as well.

You'll have to explain what you mean by this. Direct speech, text, illustrations, photos, abstract sounds, music, recordings, videos, circuits, programs, cells... these are all just different mediums with different characteristics. There is no "progression" apparent among them. Why should there be? They each fulfill different ends and have different occasions for which they best suit.

We seem to have discovered a new family of tools that help lossilly transform content or intent from one of these mediums to some others, which is sure to be useful in its own ways. But it's not a medium like the above in the first place, and with none of them representing a progression, it certainly doesn't either.

CharlieDigital1y ago

    > You'll have to explain what you mean by this

The progression of distribution. Printing press, photos, radio, movies, television. The early web was text, then came images, then audio (Napster age), and then video (remember that Netflix used to ship DVDs?).

The flip side of that is production and the ratio of producers to consumers. As the bandwidth for distribution increases, there is a decrease in the cost and complexity for producers and naturally, we see the same progression with producers on each new platform and distribution technology: text, still images, audio, moving images.

2 more replies

rhdunn1y ago

I'd argue that multimodal analysis can improve uni/bimodal models.

There is overlap between text to image and text to video -- image would help video animating interesting or complex prompts; video would help image learn how to differentiate features as there are additional clues in terms of how the image changes and remains the same.

There's overlap with audio, text transcripts, and video around learning to animate speech e.g. by leaning how faces move with the corresponding audio/text.

There's overlap with sound and video -- e.g. being able to associate sounds like dog barking without direct labelling of either.

ogogmad1y ago

> We seem to have discovered a new family of tools that help lossilly transform content or intent from one of these mediums to some others

That's not what LLMs do. More like AI art.

skydhash1y ago

It is not. Text is very dense information wise and recursive and you can formalize it. And easily coupled with interactions method. And more apt for automation. You can easily see this with software like Autocad which have both. There's a reason all protocol are texts.

Vision and audio plays a nice role, but that's because of humans and reality. real world <-> vision|audio <-> processing pipeline makes sense. But processing <-> data <-> vision|audio <-> data <-> processing cycle is just non sense and a waste of resources.

ricardo811y ago

>Text is very dense information wise and recursive and you can formalize it.

There's been a lot of attempts over the years and varying degrees of accuracy, but I don't know if you can go as far as to "formalize" it. Beyond the syntax, (tokenising, syntactic chunking and ...beyond) there is the intent, and that is super hard to measure. And possibly, the problem with these prompts is they get things right a lot of the time but wrong say 5% of the time. Purely because they couldn't formalize it. My web hosting has 99.99% uptime which is a bit more reassuring than 95%.

cooper_ganglia1y ago

Not a waste of resources, just an increase in use. This is why need more resources.

1 more reply

ricardo811y ago

You could call it bandwidth, or call it entropy. I'd lean towards the more physical definition.

I think of how the USA had cable TV and hundreds of channels projecting all kinds of whatever in the 80s while here in the UK we were limited to our finite channels. To be fair those finite channels gave people something to talk about the next day, because millions of people saw the same thing. Surely a lot of what mankind has done is to tame entropy, like steam engines etc.

With AI and everyone having a prompt, it's surely a game changer. How it works out, we'll see.

ToDougie1y ago

So long as the spectrum is open for infinity, yes.

Was listening to a Seth Godin interview where he pointed out that there was a time when you had to purchase a slice of spectrum to share your voice on the radio. Nowadays you can put your thoughts on a platform, but that platform is owned by corporations who can and will put their thumb on thoughtcrime or challenges.

I really do love your comment. Cheers.

CharlieDigital1y ago

Thanks!

There's a related concept as well which is that as "bandwidth" increases, the ratio of producers to consumers pushes upwards towards 1. My take is that generative AI will accelerate this

I write a bit more in depth about it here: https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-and...

wwweston1y ago

The information coursing through the world around us already exceeds our ability to grasp it by high orders of magnitude.

Three channels of television over 8 hours was already more than anyone had time to take in.

AI might be able to create a summarizing layers and relays that help manage that.

AI isn't going to create infinite bandwidth. It's as likely to increase entropy and introduce noise.

CharlieDigital1y ago

It's not that it makes more channels for all of us, it creates a channel for each of us.

1 more reply

layer81y ago

Text is still a predominant medium of communication and information processing, and I don’t see that changing substantially. TFA was an article, not a video, and you wouldn’t want the HN comment section to be composed of videos or images. Similarly, video calls haven’t replaced texting.

CharlieDigital1y ago

It's not that it will be replaced, but there's a natural progression of the types of media that is available on a given platform of distribution.

RF: text (telegram), audio, still images (fax), moving images

Web had the same progression: text, still images (inverted here), audio age (MP3s, Napster), video (Netflix, YouTube)

AI: text, images, audio (realtime API), ...?

Vision is the obvious next medium.

2 more replies

diffeomorphism1y ago

Vision fails the discoverability test: Buttons that don't look like buttons. Fake download buttons. Long press to do something. Swipe in the shape of a hexagon. Grey text on light grey background.

Also: thanks for tuning in, raid shadow legends, many people ask, but how... Anyway, you need these two lines of text (20 minutes YouTube videos could have been a half page of text)

Finally: Huge output of bad quality and very, very limited input capacity. So "infinite bandwidth in" and then horrible traffic jam out.

IAmGraydon1y ago

The 'bandwidth' analogy breaks down because it assumes AI's value lies in processing more information, rather than processing information more intelligently. Increasing broadcast bandwidth added linear value, but AI's advancements come from complexity, nuance, and understanding - not just sheer volume of data. 'Infinite bandwidth' doesn't guarantee better insights or decision-making; it may even lead to information overload and decreased relevance.

CharlieDigital1y ago

Gen AI's value lies in producing more variants of information. Infinitely many.

If you and I prompt OpenAI to generate an image of a woman holding a candle, we'll get two totally novel instances.

slowmovintarget1y ago

> AI is going to create "infinite bandwidth".

For whom?

If you mean infinite outpouring, then yes, but it will drown us in a sea of noise. We've constructed a Chinese Room for the mind. The computer was a bicycle, but this is something different.

Bandwidth is carrying ability, and the current incarnation of "AI" does not increase signal. It takes vastly more resources to produce something close enough, but not quite... it.

CharlieDigital1y ago

Just as there are thousands of videos on YouTube for making pancakes, there will one day be infinite videos for making pancakes.

That visual interface will watch as you prep your pancakes and give you tips, suggest a substitute if you are missing an ingredient. Your experience with that recipe will be one of "infinitely" many.

pixl971y ago

Eh even current AI where it makes a summary of your preferences is ever so slightly increasing signal. I don't expect this ability to diminish over time, but instead increase, which would lead to more personalized signaling.

croes1y ago

Vision especially GUIs are a pretty limited API.

abirch1y ago

It reminded me of these old Unix lessons of Master Foo

https://prirai.github.io/books/unix-koans/#master-foo-discou...

CharlieDigital1y ago

I mean vision in the most general sense, not just a GUI.

Imagine OpenAI can not only read the inflection in your voice, but also nuances in your facial expressions and how you're using your hands to understand your state of mind.

And instead of merely responding as an audio stream, a real-time avatar.

2 more replies

robotresearcher1y ago

> The pattern holds true for AI. AI is going to create "infinite bandwidth".

I’ve worked in AI for more than 30 years and I have no idea what you mean by this. Can you explain?

CharlieDigital1y ago

There is a ratio of producers to consumers in all media.

There's two ways to think about bandwidth. One is the physical capacity. The other is the content that can be produced and distributed.

We once had 3 channels equating to a maximum of 72h of content in a 24h period. Now we have YouTube which is orders of magnitude more content and bandwidth. The constraint now is the ratio of producers to consumers. Some creator had to create the exact content that you want.

What if gen AI can create the exact content and media experience that you want? Effectively pushing the ratio of producers and consumers towards 1 so that every experience is unique? It is effectively as if there was infinite bandwidth to create and distribute content. You are no longer constrained by physical bandwidth and no longer constrained by production bandwidth (actual creators making the content).

You want your AI generated news reel delivered by Walter Cronkite. I want mine delivered by Barbara Walters wearing a fake mustache while standing on one hand on the moon. It is as if there are infinite producers.

I write a bit more on this topic here: https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-and...

1 more reply

echoangle1y ago· 6 in thread

Am I the only one thinking this is an awful way for AI to do useful stuff for you? Why would I train an AI to use a GUI? Wouldn’t it be better to just have the AI learn API docs and use that? I don’t want the AI to open my browser, open google maps and search for Shawarma, I want the AI to call a google api and give me the result.

famouswaffles1y ago

The vast majority of Applications cannot be used by anything other than a GUI.

We built computers to be used by humans and humans overwhelmingly operate computers with GUIs. So if you want a machine that can potentially operate computers as well as humans then you're going to have to stick to GUIs.

It's the same reason we're trying to build general purpose robots in a human form factor.

The fact that a car is about as wide as a two horse drawn carriage is also no coincidence. You can't ignore existing infrastructure.

echoangle1y ago

But I don’t want an AI to „operate a computer“… maybe I’m missing the point of this but I just can’t imagine a usecase where this is a good solution. For everything browser based, the burden of making an API is probably relatively small and if the page is simple enough, you could maybe even get away with training the AI on the page HTML and generating a response to send. And for everything that’s not browser-based, I would either want the AI embedded in the software (image editors, IDEs…) or not there are all.

4 more replies

layer81y ago

A general-purpose assistant should be able to perform general-purpose operations, meaning the same things people do on their computers, and without having to supply a special-purpose AI-compatible interface for every single function the AI might need to operate. The AI should be able to operate any interface a human can operate.

Workaccount21y ago

Anthropic is selling a product to people, not software engineers.

bubaumba1y ago

How about GUI and API are just interfaces, the rest is the same. They can coexist in the same model or setup. Same functionality can be implemented usually by one or another. In advanced products by both. But model's thinking is probably the same, it operates with concepts, arranged in sort of graph. Just my guess, have no proof and likely it's not always true.

voiper11y ago

Sure, it's more effecient to have it use an API. And people have been integrating those for the last while.

But there's tons of applications that are locked behind a website or deckstop GUI with no API that are accessible via vision.

simonw1y ago· 4 in thread

If you want to try out Computer Use (awful name) in a relatively safe environment the Docker container Anthropic provide here is very easy to start running (provided you have Docker setup, I used it with Docker Deaktop for Mac): https://github.com/anthropics/anthropic-quickstarts/tree/mai...

trq_OP1y ago

Yes that's a good point! To be honest, I felt that I wanted to try it on the machine I used every day, but it's definitely a bit risky. Let me link that in the article.

danielbln1y ago

I for one appreciate the name Computer Use, no flashy marketing name, just describes what is she's. LLM using a computer.

croes1y ago

Hard to ask questions about Computer Use

swyx1y ago

also it contrasts nicely with Tool Use, which is about calling apis rather than clicking on things

1 more reply

PreInternet011y ago· 4 in thread

Counterpoint: no, it's just more hype.

Doing real-time OCR on 1280x1024 bitmaps has been possible for... the last decade or so? Sure, you can now do it on 4K or 8K bitmaps, but that's just an incremental improvement.

Fact is, full-screen OCR coupled with innovations like "Google" has not lead to "ultimate" productivity improvements, and as impressive as OpenAI et al may appear right now, the impact of these technologies will end up roughly similar.

(Which is to say: the landscape will change, but not in a truly fundamental way. What you're seeing demonstrated right now is, roughly speaking, the next Clippy, which, believe it or not, was hyped to a similar extent around the time it was introduced...)

simonw1y ago

The way these new LLM vision models work is very different from OCR.

I saw a demo this morning of someone getting Claude to play FreeCiv (admittedly extremely badly): https://twitter.com/greggyb/status/1849198544445432229

Try doing that with Tesseract.

croes1y ago

I bet Tesseract plays pretty badly too.

KoolKat231y ago

Existing OCR is extremely limited and requires custom narrow development.

acchow1y ago

"OCR : Computer Use" is as "voice-to-text : ChatGPT Voice"

pabe1y ago· 3 in thread

I don't think vision is the ultimate API. It wasn't with "traditional" RPA and it won't with more advanced AI-RPA. It's inefficient. If you want something to be used by a bot, write an interface for a bot. I'd make an exception for end2end testing.

Veen1y ago

You're looking at it from a developer's perspective. For non-developers, vision opens up all sorts of new capabilities. And they won't have to rely on the software creator's view of what should be automated and what should not.

skydhash1y ago

Most non-developers won't bother. You have shortcut on iOS and macOS which is like Scratch for automation and still only power users use it. Others just download the shortcut they want.

croes1y ago

If a GUI is confusing for humans AI will be have problems too.

So you still rely on developers to make reasonable GUIs

sharpshadow1y ago· 2 in thread

In this context Windows Recall makes total sense now from a AI learning perspective for them.

It’s actually a super cool development and I’m very exiting already to let my computer use any software like a pro infront of me. Paint me canvas of a savanna sunset with animals silhouette, produce me a track of uk garage house, etc. everything with all the layers and elements in the software not just an finished output.

croes1y ago

Lots of energy consumption just to create a remix of something that already exists.

sharpshadow1y ago

Absolutely we need much much more energy and many many more powerful chips. Energy is a resource and we need to harvest more of it.

I don’t understand why people make a point about energy consumption as it would be something bad.

2 more replies

unglaublich1y ago· 1 in thread

Vision here means "2d pixel space".

The ultimate API is "all the raw data you can acquire from your environment".

layer81y ago

For a typical GUI, the “mental model” actually needs to be 2.5D, due to stacked windows, popups, menus, modals, and so on. The article mentions that the model has difficulties with those.

throwup2381y ago· 1 in thread

Vision plus accessibility metadata is the ultimate API. I see little reason that poorly designed flat UIs are going to confuse LLMs any less than humans, especially when they’re missing from the training data like most internal apps or the documentation on the web is out of date. Even a basic dump of ARIA attributes or the hierarchy from OS accessibility APIs can help a lot.

dbish1y ago

The problem is accessibility data and apis are very bad across the board.

cheevly1y ago· 1 in thread

No, language is the ultimate API.

ukuina1y ago

On the instruction-provision end, sure.

viraptor1y ago

Some time ago I made a prediction that accessibility is the ultimate API for the UI agents, but unfortunately multimodal capabilities went the other way. But we can still change the course:

This is a great place for people to start caring about accessibility annotations. All serious UI toolkits allow you to tell the computer what's on the screen. This allows things like Windows Automation https://learn.microsoft.com/en-us/windows/win32/winauto/entr... to see a tree of controls with labels and descriptions without any vision/OCR. It can be inspected by apps like FlauiInspect https://github.com/FlaUI/FlaUInspect?tab=readme-ov-file#main... But see how the example shows a statusbar with (Text "UIA3" "")? It could've been (Text "UIA3" "Current automation interface") instead for both a good tooltip and an accessibility label.

Now we can kill two birds with one stone - actually improve the accessibility of everything and make sure custom controls adhere to the framework as well, and provide the same data to the coming automation agents. The text description will be much cheaper than a screenshot to process. Also it will help my work with manually coded app automation, so that's a win-win-win.

As a side effect, it would also solve issues with UI weirdness. Have you ever had windows open something on a screen which is not connected anymore? Or under another window? Or minimised? Screenshots won't give enough information here to progress.

downWidOutaFite1y ago

Vision is a crappy interface for computers but I think it could be a useful weapon against all the extremely "secure" platforms that refuse to give you access to your own data and refuse to interoperate with anything outside their militarized walled gardens.

tomatohs1y ago

> It is very helpful to give it things like:

- A list of applications that are open - Which application has active focus - What is focused inside the application - Function calls to specifically navigate those applications, as many as possible

We’ve found the same thing while building the client for testdriver.ai. This info is in every request.

m3kw91y ago

No, Vision in this case is a brute force way for the AI to interact with our current world because we designed the interface for human vision. In the future, AI creates the UI and their control will be low level most likely at the model level as even business logic+UI will be generated live.

freediver1y ago

And text is the ultimate API to human brain! ;)

https://www.youtube.com/watch?v=Zctp972y_Eg

throwaway199721y ago

I'd imagine you'd get higher quality leveraging accessibility integrations.

j / k navigate · click thread line to collapse

90 comments

62 comments · 15 top-level

CharlieDigital1y ago· 25 in thread

Vision is the ultimate API.

The historical progression from text to still images to audio to moving images will hold true for AI as well.

Just look at OpenAI's progression as well from LLM to multi-modal to the realtime API.

The pattern holds true for AI.

AI is going to create "infinite bandwidth".

swatcoder1y ago

> The historical progression from text to still images to audio to moving images will hold true for AI as well.

CharlieDigital1y ago

    > You'll have to explain what you mean by this

2 more replies

rhdunn1y ago

I'd argue that multimodal analysis can improve uni/bimodal models.

There's overlap with audio, text transcripts, and video around learning to animate speech e.g. by leaning how faces move with the corresponding audio/text.

There's overlap with sound and video -- e.g. being able to associate sounds like dog barking without direct labelling of either.

ogogmad1y ago

> We seem to have discovered a new family of tools that help lossilly transform content or intent from one of these mediums to some others

That's not what LLMs do. More like AI art.

skydhash1y ago

ricardo811y ago

>Text is very dense information wise and recursive and you can formalize it.

cooper_ganglia1y ago

Not a waste of resources, just an increase in use. This is why need more resources.

1 more reply

ricardo811y ago

You could call it bandwidth, or call it entropy. I'd lean towards the more physical definition.

With AI and everyone having a prompt, it's surely a game changer. How it works out, we'll see.

ToDougie1y ago

So long as the spectrum is open for infinity, yes.

I really do love your comment. Cheers.

CharlieDigital1y ago

Thanks!

There's a related concept as well which is that as "bandwidth" increases, the ratio of producers to consumers pushes upwards towards 1. My take is that generative AI will accelerate this

I write a bit more in depth about it here: https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-and...

wwweston1y ago

The information coursing through the world around us already exceeds our ability to grasp it by high orders of magnitude.

Three channels of television over 8 hours was already more than anyone had time to take in.

AI might be able to create a summarizing layers and relays that help manage that.

AI isn't going to create infinite bandwidth. It's as likely to increase entropy and introduce noise.

CharlieDigital1y ago

It's not that it makes more channels for all of us, it creates a channel for each of us.

1 more reply

layer81y ago

CharlieDigital1y ago

It's not that it will be replaced, but there's a natural progression of the types of media that is available on a given platform of distribution.

RF: text (telegram), audio, still images (fax), moving images

Web had the same progression: text, still images (inverted here), audio age (MP3s, Napster), video (Netflix, YouTube)

AI: text, images, audio (realtime API), ...?

Vision is the obvious next medium.

2 more replies

diffeomorphism1y ago

Vision fails the discoverability test: Buttons that don't look like buttons. Fake download buttons. Long press to do something. Swipe in the shape of a hexagon. Grey text on light grey background.

Also: thanks for tuning in, raid shadow legends, many people ask, but how... Anyway, you need these two lines of text (20 minutes YouTube videos could have been a half page of text)

Finally: Huge output of bad quality and very, very limited input capacity. So "infinite bandwidth in" and then horrible traffic jam out.

IAmGraydon1y ago

CharlieDigital1y ago

Gen AI's value lies in producing more variants of information. Infinitely many.

If you and I prompt OpenAI to generate an image of a woman holding a candle, we'll get two totally novel instances.

slowmovintarget1y ago

> AI is going to create "infinite bandwidth".

For whom?

If you mean infinite outpouring, then yes, but it will drown us in a sea of noise. We've constructed a Chinese Room for the mind. The computer was a bicycle, but this is something different.

Bandwidth is carrying ability, and the current incarnation of "AI" does not increase signal. It takes vastly more resources to produce something close enough, but not quite... it.

CharlieDigital1y ago

Just as there are thousands of videos on YouTube for making pancakes, there will one day be infinite videos for making pancakes.

That visual interface will watch as you prep your pancakes and give you tips, suggest a substitute if you are missing an ingredient. Your experience with that recipe will be one of "infinitely" many.

pixl971y ago

croes1y ago

Vision especially GUIs are a pretty limited API.

abirch1y ago

It reminded me of these old Unix lessons of Master Foo

https://prirai.github.io/books/unix-koans/#master-foo-discou...

CharlieDigital1y ago

I mean vision in the most general sense, not just a GUI.

Imagine OpenAI can not only read the inflection in your voice, but also nuances in your facial expressions and how you're using your hands to understand your state of mind.

And instead of merely responding as an audio stream, a real-time avatar.

2 more replies

robotresearcher1y ago

> The pattern holds true for AI. AI is going to create "infinite bandwidth".

I’ve worked in AI for more than 30 years and I have no idea what you mean by this. Can you explain?

CharlieDigital1y ago

There is a ratio of producers to consumers in all media.

There's two ways to think about bandwidth. One is the physical capacity. The other is the content that can be produced and distributed.

I write a bit more on this topic here: https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-and...

1 more reply

echoangle1y ago· 6 in thread

famouswaffles1y ago

The vast majority of Applications cannot be used by anything other than a GUI.

It's the same reason we're trying to build general purpose robots in a human form factor.

The fact that a car is about as wide as a two horse drawn carriage is also no coincidence. You can't ignore existing infrastructure.

echoangle1y ago

4 more replies

layer81y ago

Workaccount21y ago

Anthropic is selling a product to people, not software engineers.

bubaumba1y ago

voiper11y ago

Sure, it's more effecient to have it use an API. And people have been integrating those for the last while.

But there's tons of applications that are locked behind a website or deckstop GUI with no API that are accessible via vision.

simonw1y ago· 4 in thread

trq_OP1y ago

Yes that's a good point! To be honest, I felt that I wanted to try it on the machine I used every day, but it's definitely a bit risky. Let me link that in the article.

danielbln1y ago

I for one appreciate the name Computer Use, no flashy marketing name, just describes what is she's. LLM using a computer.

croes1y ago

Hard to ask questions about Computer Use

swyx1y ago

also it contrasts nicely with Tool Use, which is about calling apis rather than clicking on things

1 more reply

PreInternet011y ago· 4 in thread

Counterpoint: no, it's just more hype.

Doing real-time OCR on 1280x1024 bitmaps has been possible for... the last decade or so? Sure, you can now do it on 4K or 8K bitmaps, but that's just an incremental improvement.

simonw1y ago

The way these new LLM vision models work is very different from OCR.

I saw a demo this morning of someone getting Claude to play FreeCiv (admittedly extremely badly): https://twitter.com/greggyb/status/1849198544445432229

Try doing that with Tesseract.

croes1y ago

I bet Tesseract plays pretty badly too.

KoolKat231y ago

Existing OCR is extremely limited and requires custom narrow development.

acchow1y ago

"OCR : Computer Use" is as "voice-to-text : ChatGPT Voice"

pabe1y ago· 3 in thread

Veen1y ago

skydhash1y ago

Most non-developers won't bother. You have shortcut on iOS and macOS which is like Scratch for automation and still only power users use it. Others just download the shortcut they want.

croes1y ago

If a GUI is confusing for humans AI will be have problems too.

So you still rely on developers to make reasonable GUIs

sharpshadow1y ago· 2 in thread

In this context Windows Recall makes total sense now from a AI learning perspective for them.

croes1y ago

Lots of energy consumption just to create a remix of something that already exists.

sharpshadow1y ago

Absolutely we need much much more energy and many many more powerful chips. Energy is a resource and we need to harvest more of it.

I don’t understand why people make a point about energy consumption as it would be something bad.

2 more replies

unglaublich1y ago· 1 in thread

Vision here means "2d pixel space".

The ultimate API is "all the raw data you can acquire from your environment".

layer81y ago

For a typical GUI, the “mental model” actually needs to be 2.5D, due to stacked windows, popups, menus, modals, and so on. The article mentions that the model has difficulties with those.

throwup2381y ago· 1 in thread

dbish1y ago

The problem is accessibility data and apis are very bad across the board.

cheevly1y ago· 1 in thread

No, language is the ultimate API.

ukuina1y ago

On the instruction-provision end, sure.

viraptor1y ago

Some time ago I made a prediction that accessibility is the ultimate API for the UI agents, but unfortunately multimodal capabilities went the other way. But we can still change the course:

downWidOutaFite1y ago

tomatohs1y ago

> It is very helpful to give it things like:

- A list of applications that are open - Which application has active focus - What is focused inside the application - Function calls to specifically navigate those applications, as many as possible

We’ve found the same thing while building the client for testdriver.ai. This info is in every request.

m3kw91y ago

freediver1y ago

And text is the ultimate API to human brain! ;)

https://www.youtube.com/watch?v=Zctp972y_Eg

throwaway199721y ago

I'd imagine you'd get higher quality leveraging accessibility integrations.

j / k navigate · click thread line to collapse