What happened is basically this:
- X11 was fairly complex, had a lot of dead-weight code, and a whole bunch of fundamental issues that were building up to warrant an "X12" and breaking the core protocols to get fixed.
- Alongside this, at some point in the relatively distant past, the XFree86 implementation (the most used one at the time, which later begat Xorg) was massively fat and did a huge amount of stuff, including driver-level work - think "PCI in userspace".
- Over the years, more and more of the work moved out of the X server and into e.g. the Linux kernel. Drivers, mode setting, input event handling, higher-level input event handling (libinput). Xorg also got a bit cleaner and modularized, making some of the remaining hard bits available in library form (e.g. xkb).
- With almost everything you need for a windowing system now cleanly factored out and no longer in X, trying out a wholly new windowing system suddenly became a tractable problem. This enabled the original Wayland author to attempt it and submit it for others to appraise.
- A lot of the active X and GUI developer liked what they saw and it managed to catch on.
- A lot of the people involved were in fact not naive, and did not ignore the classic "should you iterate or start over" conundrum. Wayland had a fairly strong backward compat story very early on in the form of Xwayland, which was created almost immediately, and convinced a lot of people.
In the end this is a very classic software engineering story. Would it have been possible to iterate X11 toward what Wayland is now in-place? Maybe. Not sure. Would the end result look a lot like Wayland today? Probably, the Wayland design is still quite good and modern.
It's a lot like Python 2.x vs. 3.x in the end.
X11 was ok for it's time, but fundamentally it's an really outdated design to solve 80s/90s issues over the network in the way you solved it back then.
Why not use a much simpler command buffer model like on 3D APIs, and only for actually asynchronous commands (but not simple synchronous getters)?
PS: is io_uring using sockets for its "shared memory ringbuffers"? If not why? Too much overhead? And if there's notable socket overhead, why is Wayland using sockets (since it has to solve a simar problem as io_uring or 3D APIs).
We have eons better ways to transport graphical buffers around (also, what about sound? That seems reasonable - but now should the display server also be a sound server?), so building it into such a core protocol just doesn't make sense.