If the ultimate goal is having a LLM control a computer, round-tripping through a UX designed for bipedal bags of meat with weird jelly-filled optical sensors is wildly inefficient.
Just stay in the computer! You're already there! Vision-driven computer use is a dead end.