Turns out with very few lines of code the results are already impressive, GPT-V can really control my computer super well and I can as it to do whatever tasks by itself, it clicks around, type stuff and press buttons to navigate
Would love to hear your thoughts on it!
Nice work too!
thanks vim!
however, I tried self-operating-computer, and it could not find the right x,y positions on the screen executes the task as effectively
Edit; also probably good to, in case it is not sure, to open the browser and try there.
need to figure out the right UX to do that, and I think the multi-modal models also need to get a bit faster
For example in the twitter screenshot, it could use just the one image.
we could try to patch an "interpolation" kinda of thing for change, but also, I'm curious to see if the multi-modal models that are coming out supporting video would be able to actually just "watch the video" in real time, this would be the ultimate solution