Even such "simple" tasks as how to composite a window, a video in that window, and a floating menu over that window are not very well specified in any OS (try resizing that window and watch the fun). Or, for example, your floating menu--should the parent window resize itself and paint with transparent pixels in the extra area or should that menu be a separate "window"? And, what does being a "window" even mean?
Add into that the fact that we really should be making multithreaded GUI systems at this point and it's very clear that GUIs are stuck in a local minimum that's really hard to get out of because GUI systems require so many lines of code.