UFO: A UI-Focused AI Agent for Windows OS Interaction (opens in new tab)

(github.com)

87 pointsDreamGen2y ago62 comments

62 comments

54 comments · 23 top-level

smashah2y ago· 5 in thread

If anyone else did this then some useless lawyer would be activated from their slumber to send out OSS devs cease & desists based on supposed violations of sacred terms and conditions.

This project is the future that we were promised but it is under threat by supposed legal challenges.

What if the demo asked the email to be sent as a message via whatsapp desktop instead? That would (according to their anti-freedom lawyers) constitute an offense worthy of legal threats.

The tech industry needs to reckon with ToS trolls before it's too late.

klabb32y ago

Rules for thee not for me. Scraping for instance is bad only when other people do it.

Technically speaking though, user facing websites are built with quite weak UI accessibility in mind. Partly to prevent others bots from using it. I worry less about the LLM/AI particularities, and more about adding yet another comprehensive layer to the stack. It’d not exactly be standing on solid foundation.

During the 2010s Web 2.0 there was a brief moment where open APIs and such were trending, which was a bit of rejuvenation of interoperability across companies, domains, applications etc. To simplify we can call it cross-app interactions, like auto hotkey, Automator, etc. An LLM controlled broker falls in that domain as well. But now that “closed binary fuck you” model came back industry wide, it’s almost impossible to build such things. (It’s hard enough to build integrations when people are cooperating. When they’re actively adversarial, things generally break quickly, if they work at all)

smashah2y ago

The promises of web 2.0 were stolen from us.

Adversarial interop should be a digital human right.

Put an API layer in front of UFO and all of a sudden we're one step closer to the unshittification of our digital lives.

1 more reply

dartos2y ago

I’ve worked on many sites over the past 10 years.

Never once was the conscious decision made to not be accessible to prevent bots from scraping.

1 more reply

tiborsaas2y ago

I don't understand why anything in the demo would count as T&C violation. Who would be offended on what basis? If I can automate my desktop with a preprogrammed script, then this is not different.

smashah2y ago

As I said, if the demo opened Whatsapp and sent a message then their lawyers would harass the OSS developer of the tool with C&Ds and they'd consider it a breach of their ToS and block your account.

This is not hyperbole, that's what they will do based on precedent. And it doesn't matter about the validity of their claims because the calculation as a victim to these legal threats is that this $800bn megacorp is able to ruin your life for what amounts to less than pocket change and these big law firms are incentivized to come after you.

Adversarial interop should be an inalienable digital human right. This way companies will be forced to give API access or risk interoperability being legally scraped against their will.

tflol2y ago· 5 in thread

have they fixed teams yet?

that app that like the entire united states uses for pc work every day?

i still cant copy paste a code block, or copy paste literally anything. i think microsoft should use AI to learn how to code code blocks in chat or they should ask chatgpt how to use the clipboard of their own OS

TeMPOraL2y ago

Yes of course. It works perfectly for what it was meant to be: good enough so people won't switch away from MS ecosystem to have a group chat. Nothing more.

Teams is actively developed along the lines that deliver greatest value to Microsoft at the expense of their customers (a business model increasingly popular these days):

1) implementing corporate-nerfed versions of vanity features introduced by competitors in group chat space (Slack, Discord);

2) broader integration with everything else in Microsoft's corporate ecosystem.

You can be excused for thinking Teams is just a crappy chat-based interface to SharePoint, because this is what it effectively is (Don't have SharePoint? Sucks to be you.).

Copy-paste? What are you? A corporate smartass? There's no budget left for smart-ass features - it's all in lock-in features, where the RoI is much greater.

zhivota2y ago

Copy paste? What are you trying to do, exfiltrate company secrets? Security turned off that feature to protect us all.

userbinator2y ago

MS Teams is exactly what I'd expect from an AI-generated app --- barely working, lots of little bugs and general oddness, and overall a mediocre experience.

dartos2y ago

Isn’t teams quite an old product?

nxobject2y ago

Every day, I thank myself for working in the one municipality that went in early and hard on GSuite.

ukuina2y ago· 3 in thread

Note this is from Microsoft!

They are on a roll lately, and seem to have beaten OpenAI to GPT-Agents with this release.

dbish2y ago

Hard to "beat" someone when you have a very close partnership and you're using their model (GPT-4V), likely this is developed quickly with input from OpenAI. I don't think it's right to think of the two as competing at this point.

ukuina2y ago

If I understand correctly, there is zero overlap between Microsoft AI Research efforts that use GPT4(V) and OpenAI's development.

OpenAI gave Microsoft the model weights, and Microsoft hosts it on Azure for MS Research. None of the Azure usage analytics goes back to OpenAI except via bug reports and publications.

1 more reply

capybara_20202y ago

Just my experience. Tried Copilot(the one for Office) but was left disappointed.

It said it could not do say add an image to a ppt/create one. When it clearly could. Chat was was overly simplified and fixed the wrong problem. Randomly changed say Excel sheet with no clear undo after multiple steps Plus how it works is hidden, so not sure if they are using 3.5, 4 or something else. So no idea if that is causing the problem.

MS is putting out a lot of things quickly. But the quality is just not there in my experience. They are doing way too many things too fast to make any one thing good.

dartharva2y ago· 3 in thread

Does it support other visual-input-accepting language models? GPT-V is paywalled.

Deverauxi2y ago

The code isn’t very complicated. You could edit in any vision model you want:

https://github.com/microsoft/UFO/blob/main/ufo/llm/llm_call....

dartharva2y ago

I don't have the resources currently but would love to test something like LLaVA out for this, but I won't keep my hopes up.

dartharva2y ago

Update: no, it doesn't as of now.

cuckatoo2y ago· 3 in thread

They can't migrate control panel to the settings app in a time measured in decades, but I'm supposed to believe that they're going to magically produce something like this that does what it claims? Go home Microsoft. You're drunk.

austinwade2y ago

The engineers who built this UFO project probably would have been able to move Control Panel to settings. Microsoft is a big company.

batch122y ago

It sounds like the software didn't work out when you tried it. What issues did you have? Or were you using this chance to dunk on M$ without actually giving it a shot? I think Microsoft has its issues, but this response was boring.

MaxikCZ2y ago

I like seeing the context, imo it is important.

1 more reply

gareth_untether2y ago· 2 in thread

The following section from the readme stands out:

The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process.

userbinator2y ago

That sounds like "don't use this" to everyone I know; or is this meant to be a disclaimer and at the same time normalise the lack of privacy?

ukuina2y ago

Doesn't Azure OpenAI support resolve the privacy concerns?

2 more replies

DonHopkins2y ago· 2 in thread

I had ChatGPT analyze some Factorio screen dumps, which was interesting:

https://docs.google.com/document/d/14p_iPhIKjDoTGa2Zr_5gPV9_...

g3ol4d02y ago

is it somewhat right? I don't know much about factorio

nicklecompte2y ago

Looking at it myself it's somewhat right, but it's an enormous pile of useless LLM blather - the response to "analyze the factory layout and activity" has nothing actionable, and most comments are generic ("well-organized and appears to be optimized for the efficient production of science packs"). Likewise the "analyze and optimize" requests don't seem useful enough for players to do anything with, though a lot of that is the prompt. As with most LLM stuff it just seems like someone threw a bunch of Factorio forum/Reddit posts in a blender and filtered out the stuff that was blatantly wrong.

Some specific errors I did notice:

- In the section with a processor unit subfactory, "The specific items being manufactured are not directly visible because the UI for the assembling machines is not expanded to show their recipes" is false, the player has pressed 'Alt' and the items being manufactured are shown. So this part of the response is plain wrong.

- " There are four distinct colors of science packs visible on the conveyor belts: red, green, blue, and purple, corresponding to the various levels of research complexity in the game." Only red/green/blue are shown. ChatGPT didn't make additional references to purple science but it was odd that it mentioned them at all.

- "In this Factorio image, we see a railway intersection that includes train signaling and a train crash" this is a deadlock, nothing has actually crashed. This is a minor nitpick but ChatGPT repeats the error throughout the analysis. And it also suggests that ChatGPT might not fully grok the "race condition" side of train scheduling, since deadlocks occur specifically because you're trying to avoid collisions.

I didn't want to spend too much brain calories reading all 20 pages in depth :) My general conclusion is that it's not useful enough for GPT-assisted Factorio play, and too flaky for any sort of automation of trivial Factorio tasks. I think it's plausible to make a FactorioGPT, but I doubt OpenAI's pretraining and RLHF resources covered this specific niche.

ThinkBeat2y ago· 2 in thread

Is this a way to automate /(create a UI macro) of windows applications? Sort of like AutoHotKey but with a nicer developing experience? Reading the page and watching the demo and I am still a bit confused about what it is.

dartharva2y ago

The paper (https://arxiv.org/pdf/2402.07939.pdf) says it is a universal tool to automate all kinds of work on Windows apps, like generating an email in Outlook or building slides based on some documents on Powerpoint. Cross-app working seems to be bad for now.

On a side note, it looks like this thing can be a terrific cheating tool in Strategy video games..

ametrau2y ago

This is a paper?

hyperhello2y ago· 1 in thread

So they built an AI that can use the windows environment. Maybe they could make a simple graphical shell to control the AI.

65102y ago

That sounds both practical and hilarious. Imagine wrapping the whole OS in a pdf file.

yonatan80702y ago· 1 in thread

With all the emoji in the README, this looks like a JS framework

Piisamirotta2y ago

That is super annoying. Or maybe I'm becoming old.

_sword2y ago· 1 in thread

Like a Windows-focused copy of the the Self Operating Computer Framework. There's clearly something here with this approach if it's popping up in multiple open source frameworks

anonzzzies2y ago

But they don’t work at all. They will get better but they all are pretty much unworkable for the moment.

nxobject2y ago· 1 in thread

If this idea's in the air, I wouldn't be surprised if Apple's working on a similar concept, but with an accessibility or a "Siri remake" bent to it.

baq2y ago

If Apple isn't working on a LLM Siri remake since December 2022 I'd be very, very surprised. Ditto for Amazon and Alexa, there's some public information at https://developer.amazon.com/en-US/alexa/alexa-ai but hope it isn't all they're working on.

francesco3142y ago· 1 in thread

It would be a lot better if we could use a local model. Doesn't seem anyone has done that fork yet?

fdondi2y ago

Also it seems this is a task for which a model could be strongly fine-tuned. We just need a dataset of typical interactions...

DonHopkins2y ago· 1 in thread

Anyone who uses the term "work smarter not harder" in a demo video, or at any point in their life in any situation, should not be taken seriously.

kQq9oHeAz6wLLS2y ago

That rules out every parent trying to teach their kids to think critically. I don't think you'll like where that leads us.

anonzzzies2y ago

This is from MS , so it works better than other attempts for other OSs? I tried a few Mac, browser and linux attempts and they were unusable; clicking in the wrong place, having no clue what to do etc…. And can this handle a browser or is that off limits (considering how many projects tried that and failed)?

DonHopkins2y ago

It's more useful to use AI (or simply your human brain) to design good user interfaces in the first place, instead of using it to paper over flabbergastingly terrible user interface designs.

rewgs2y ago

On one hand, I'm really glad to have come across this this morning, as I'm trying to automate a task that requires GUI automation due to a lack of an API. But I'm stuck even on that because some windows don't accept key input to select the "Enter" button (why? no idea), so now I'm having to drop into mouse automation, which of course is super imprecise, being DPI-dependent. So, taking screenshots and then feeding those into a LLM is a solid solution, I guess.

On the other hand, I shudder to think of the millions of man hours required to arrive at this solution, when simple UI guidelines, or better yet, an API, would have solved my problem far more simply and efficiently.

jpalomaki2y ago

Interesting approach. In some organizatons you have those ancient apps that are difficult to use/deploy and impossible to replace.

Sometimes these are automated with ”robotic procesd automation” tools. Something like UFO could streamline the process.

beefnugs2y ago

Interesting... but... you know this means to make this work well they are going to start recording EVERYONES screens against their will to get more training data right?

hubraumhugo2y ago

That's quite big news for many RPA startups I guess?

lloydatkinson2y ago

Python! Some of their teams have absolutely no faith in their own languages and frameworks.

DonHopkins2y ago

Not as impressive or interesting as CogAgent:

CogVLM: Visual Expert for Pretrained Language Models

CogAgent: A Visual Language Model for GUI Agents

https://arxiv.org/abs/2312.08914

https://github.com/THUDM/CogVLM

https://arxiv.org/pdf/2312.08914.pdf

CogAgent: A Visual Language Model for GUI Agents

Abstract

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120×1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks—Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM .

1. Introduction

Autonomous agents in the digital world are ideal assistants that many modern people dream of. Picture this scenario: You type in a task description, then relax and enjoy a cup of coffee while watching tasks like booking tickets online, conducting web searches, managing files, and creating PowerPoint presentations get completed automatically.

Recently, the emergence of agents based on large language models (LLMs) is bringing us closer to this dream. For example, AutoGPT [33], a 150,000-star open-source project, leverages ChatGPT [29] to integrate language understanding with pre-defined actions like Google searches and local file operations. Researchers are also starting to develop agent-oriented LLMs [7, 42]. However, the potential of purely language-based agents is quite limited in realworld scenarios, as most applications interact with humans through Graphical User Interfaces (GUIs), which are characterized by the following perspectives:

• Standard APIs for interaction are often lacking.

• Important information including icons, images, diagrams, and spatial relations are difficult to directly convey in words.

• Even in text-rendered GUIs like web pages, elements like canvas and iframe cannot be parsed to grasp their functionality via HTML.

Agents based on visual language models (VLMs) have the potential to overcome these limitations. Instead of relying exclusively on textual inputs such as HTML [28] or OCR results [31], VLM-based agents directly perceive visual GUI signals. Since GUIs are designed for human users, VLM-based agents can perform as effectively as humans, as long as the VLMs match human-level vision understanding. In addition, VLMs are also capable of skills such as extremely fast reading and programming that are usually beyond the reach of most human users, extending the potential of VLM-based agents. A few prior studies utilized visual features merely as auxiliaries in specific scenarios. e.g. WebShop [39] which employs visual features primarily for object recognition purposes. With the rapid development of VLM, can we naturally achieve universality on GUIs by relying solely on visual inputs?

In this work, we present CogAgent, a visual language foundation model specializing in GUI understanding and planning while maintaining a strong ability for general cross-modality tasks. By building upon CogVLM [38]—a recent open-source VLM, CogAgent tackles the following challenges for building GUI agents: [...]

v3ss0n2y ago

Finally! Disclosure happened . The company which David Grush said holding UFO technology provided by AlIens are Microsoft! Who would have thought that. Truly non-Human intelligence assisted technology.

j / k navigate · click thread line to collapse

62 comments

54 comments · 23 top-level

smashah2y ago· 5 in thread

If anyone else did this then some useless lawyer would be activated from their slumber to send out OSS devs cease & desists based on supposed violations of sacred terms and conditions.

This project is the future that we were promised but it is under threat by supposed legal challenges.

What if the demo asked the email to be sent as a message via whatsapp desktop instead? That would (according to their anti-freedom lawyers) constitute an offense worthy of legal threats.

The tech industry needs to reckon with ToS trolls before it's too late.

klabb32y ago

Rules for thee not for me. Scraping for instance is bad only when other people do it.

smashah2y ago

The promises of web 2.0 were stolen from us.

Adversarial interop should be a digital human right.

Put an API layer in front of UFO and all of a sudden we're one step closer to the unshittification of our digital lives.

1 more reply

dartos2y ago

I’ve worked on many sites over the past 10 years.

Never once was the conscious decision made to not be accessible to prevent bots from scraping.

1 more reply

tiborsaas2y ago

I don't understand why anything in the demo would count as T&C violation. Who would be offended on what basis? If I can automate my desktop with a preprogrammed script, then this is not different.

smashah2y ago

As I said, if the demo opened Whatsapp and sent a message then their lawyers would harass the OSS developer of the tool with C&Ds and they'd consider it a breach of their ToS and block your account.

Adversarial interop should be an inalienable digital human right. This way companies will be forced to give API access or risk interoperability being legally scraped against their will.

tflol2y ago· 5 in thread

have they fixed teams yet?

that app that like the entire united states uses for pc work every day?

TeMPOraL2y ago

Yes of course. It works perfectly for what it was meant to be: good enough so people won't switch away from MS ecosystem to have a group chat. Nothing more.

Teams is actively developed along the lines that deliver greatest value to Microsoft at the expense of their customers (a business model increasingly popular these days):

1) implementing corporate-nerfed versions of vanity features introduced by competitors in group chat space (Slack, Discord);

2) broader integration with everything else in Microsoft's corporate ecosystem.

You can be excused for thinking Teams is just a crappy chat-based interface to SharePoint, because this is what it effectively is (Don't have SharePoint? Sucks to be you.).

Copy-paste? What are you? A corporate smartass? There's no budget left for smart-ass features - it's all in lock-in features, where the RoI is much greater.

zhivota2y ago

Copy paste? What are you trying to do, exfiltrate company secrets? Security turned off that feature to protect us all.

userbinator2y ago

MS Teams is exactly what I'd expect from an AI-generated app --- barely working, lots of little bugs and general oddness, and overall a mediocre experience.

dartos2y ago

Isn’t teams quite an old product?

nxobject2y ago

Every day, I thank myself for working in the one municipality that went in early and hard on GSuite.

ukuina2y ago· 3 in thread

Note this is from Microsoft!

They are on a roll lately, and seem to have beaten OpenAI to GPT-Agents with this release.

dbish2y ago

ukuina2y ago

If I understand correctly, there is zero overlap between Microsoft AI Research efforts that use GPT4(V) and OpenAI's development.

OpenAI gave Microsoft the model weights, and Microsoft hosts it on Azure for MS Research. None of the Azure usage analytics goes back to OpenAI except via bug reports and publications.

1 more reply

capybara_20202y ago

Just my experience. Tried Copilot(the one for Office) but was left disappointed.

MS is putting out a lot of things quickly. But the quality is just not there in my experience. They are doing way too many things too fast to make any one thing good.

dartharva2y ago· 3 in thread

Does it support other visual-input-accepting language models? GPT-V is paywalled.

Deverauxi2y ago

The code isn’t very complicated. You could edit in any vision model you want:

https://github.com/microsoft/UFO/blob/main/ufo/llm/llm_call....

dartharva2y ago

I don't have the resources currently but would love to test something like LLaVA out for this, but I won't keep my hopes up.

dartharva2y ago

Update: no, it doesn't as of now.

cuckatoo2y ago· 3 in thread

austinwade2y ago

The engineers who built this UFO project probably would have been able to move Control Panel to settings. Microsoft is a big company.

batch122y ago

MaxikCZ2y ago

I like seeing the context, imo it is important.

1 more reply

gareth_untether2y ago· 2 in thread

The following section from the readme stands out:

The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process.

userbinator2y ago

That sounds like "don't use this" to everyone I know; or is this meant to be a disclaimer and at the same time normalise the lack of privacy?

ukuina2y ago

Doesn't Azure OpenAI support resolve the privacy concerns?

2 more replies

DonHopkins2y ago· 2 in thread

I had ChatGPT analyze some Factorio screen dumps, which was interesting:

https://docs.google.com/document/d/14p_iPhIKjDoTGa2Zr_5gPV9_...

g3ol4d02y ago

is it somewhat right? I don't know much about factorio

nicklecompte2y ago

Some specific errors I did notice:

ThinkBeat2y ago· 2 in thread

dartharva2y ago

On a side note, it looks like this thing can be a terrific cheating tool in Strategy video games..

ametrau2y ago

This is a paper?

hyperhello2y ago· 1 in thread

So they built an AI that can use the windows environment. Maybe they could make a simple graphical shell to control the AI.

65102y ago

That sounds both practical and hilarious. Imagine wrapping the whole OS in a pdf file.

yonatan80702y ago· 1 in thread

With all the emoji in the README, this looks like a JS framework

Piisamirotta2y ago

That is super annoying. Or maybe I'm becoming old.

_sword2y ago· 1 in thread

Like a Windows-focused copy of the the Self Operating Computer Framework. There's clearly something here with this approach if it's popping up in multiple open source frameworks

anonzzzies2y ago

But they don’t work at all. They will get better but they all are pretty much unworkable for the moment.

nxobject2y ago· 1 in thread

If this idea's in the air, I wouldn't be surprised if Apple's working on a similar concept, but with an accessibility or a "Siri remake" bent to it.

baq2y ago

francesco3142y ago· 1 in thread

It would be a lot better if we could use a local model. Doesn't seem anyone has done that fork yet?

fdondi2y ago

Also it seems this is a task for which a model could be strongly fine-tuned. We just need a dataset of typical interactions...

DonHopkins2y ago· 1 in thread

Anyone who uses the term "work smarter not harder" in a demo video, or at any point in their life in any situation, should not be taken seriously.

kQq9oHeAz6wLLS2y ago

That rules out every parent trying to teach their kids to think critically. I don't think you'll like where that leads us.

anonzzzies2y ago

DonHopkins2y ago

It's more useful to use AI (or simply your human brain) to design good user interfaces in the first place, instead of using it to paper over flabbergastingly terrible user interface designs.

rewgs2y ago

jpalomaki2y ago

Interesting approach. In some organizatons you have those ancient apps that are difficult to use/deploy and impossible to replace.

Sometimes these are automated with ”robotic procesd automation” tools. Something like UFO could streamline the process.

beefnugs2y ago

Interesting... but... you know this means to make this work well they are going to start recording EVERYONES screens against their will to get more training data right?

hubraumhugo2y ago

That's quite big news for many RPA startups I guess?

lloydatkinson2y ago

Python! Some of their teams have absolutely no faith in their own languages and frameworks.

DonHopkins2y ago

Not as impressive or interesting as CogAgent:

CogVLM: Visual Expert for Pretrained Language Models

CogAgent: A Visual Language Model for GUI Agents

https://arxiv.org/abs/2312.08914

https://github.com/THUDM/CogVLM

https://arxiv.org/pdf/2312.08914.pdf

CogAgent: A Visual Language Model for GUI Agents