Show HN: GPT-V and OCR for Screen Control (opens in new tab)

(github.com)

22 pointsrchaves2y ago10 comments

10 comments

10 comments · 4 top-level

rchavesOP2y ago· 4 in thread

Hey there everyone, now that AI can "see" very well with GPT-V, I was wondering if it can interact with a computer like we do, just by looking at it. Well, one of the shortcommings of GPT-V is that it cannot really pinpoint the x,y coordinates of something in the screen very well, but I solved it by combining it with simple OCR, and annotating for GPT-V to tell where it wants to click

Turns out with very few lines of code the results are already impressive, GPT-V can really control my computer super well and I can as it to do whatever tasks by itself, it clicks around, type stuff and press buttons to navigate

Would love to hear your thoughts on it!

wills_forward2y ago

I really like the elegant simplicity of tagging the screen elements like that and not obfuscating it away.

Nice work too!

rchavesOP2y ago

thanks! I took my inspiration from Vim browser plugin (https://chromewebstore.google.com/detail/vimium/dbepggeogbai...), they have a shortcut F that allows you to choose any element on the website to navigate from

thanks vim!

hnuser1234562y ago

Have you already seen OthersideAI self-operating-computer? It sounds like exactly what you're describing: https://www.youtube.com/watch?v=UKRti40U8IA

rchavesOP2y ago

yes actually, but I only saw it after I've implemented it, I had actually searched for something like that before but I guess Google is worse and worse those days

however, I tried self-operating-computer, and it could not find the right x,y positions on the screen executes the task as effectively

anonzzzies2y ago· 1 in thread

Nice work, I was looking for this for a while and no time to do it myself. I would say it's probably a good idea to make it ai-assisted ; many things you can do faster yourself by saying 'click h2' , fill in text 'hello world' etc instead of having the LLM figure it out. So a combination of things basically. But very good start!

Edit; also probably good to, in case it is not sure, to open the browser and try there.

rchavesOP2y ago

indeed! Ideally I want it to have very real time human-machine feedback, so you can interrupt it in the middle, point at things, then ask new things, and so on, kinda like if there is someone else pairing with you, and you are telling them to do stuff and course correcting

need to figure out the right UX to do that, and I think the multi-modal models also need to get a bit faster

xelia2y ago· 1 in thread

I wonder if this can be optimized by letting GPT provide multiple instructions per screenshot instead of just one.

For example in the twitter screenshot, it could use just the one image.

rchavesOP2y ago

If you look closely it actually does give multiple instructions per screenshot! However it cannot get too far, because the screen changes under it. For example when it starts typing a tweet, the tweet box expands and the send button moves, so it tries to click it but it's not longer there, it needs to take another screenshot to see because it's kinda executing those steps "in the dark"

we could try to patch an "interpolation" kinda of thing for change, but also, I'm curious to see if the multi-modal models that are coming out supporting video would be able to actually just "watch the video" in real time, this would be the ultimate solution

okish2y ago

Take a look at this related work https://arxiv.org/abs/2310.11441

j / k navigate · click thread line to collapse

10 comments

10 comments · 4 top-level

rchavesOP2y ago· 4 in thread

Would love to hear your thoughts on it!

wills_forward2y ago

I really like the elegant simplicity of tagging the screen elements like that and not obfuscating it away.

Nice work too!

rchavesOP2y ago

thanks vim!

hnuser1234562y ago

Have you already seen OthersideAI self-operating-computer? It sounds like exactly what you're describing: https://www.youtube.com/watch?v=UKRti40U8IA

rchavesOP2y ago

yes actually, but I only saw it after I've implemented it, I had actually searched for something like that before but I guess Google is worse and worse those days

however, I tried self-operating-computer, and it could not find the right x,y positions on the screen executes the task as effectively

anonzzzies2y ago· 1 in thread

Edit; also probably good to, in case it is not sure, to open the browser and try there.

rchavesOP2y ago

need to figure out the right UX to do that, and I think the multi-modal models also need to get a bit faster

xelia2y ago· 1 in thread

I wonder if this can be optimized by letting GPT provide multiple instructions per screenshot instead of just one.

For example in the twitter screenshot, it could use just the one image.

rchavesOP2y ago

okish2y ago

Take a look at this related work https://arxiv.org/abs/2310.11441

j / k navigate · click thread line to collapse