Doesn't the CPU/GPU bottleneck which is already assumed to be slow actually provide the perfect opportunity for abstraction over a network protocol? Sending "what to draw" and "how" (shaders) over the wire infrequently and issuing cheap draw commands on demand? I think GPUs provide a better situation for a network first model than was available when X was designed.
Only if everyone agrees on a central rendering model & feature set, which simply isn't the case. 2D rendering is not a solved problem, there are many different takes on it. Insisting all GUIs on a given device use a single system is simply not realistic, which is why nobody actually uses any of the X rendering commands other than "draw pixmap" (aka, just be a dumb compositor)