This is the key to accurate control, it needs to be very precise.
Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?