X11 may be a dead end but Wayland sucks as a replacement, so for now, I see no other option than supporting them both.
It may be technically possible to do the equivalent do X11 forwarding with Wayland, that is connecting to a server with a ssh terminal (no remote desktop, headless server), run a GUI app, and have it display its windows on my own desktop as if it was running locally. The problem is that Wayland is 17 years old and I still can't.
For any kind decent remote desktop access you need good performance, specifically low latency. X11 just isn't there.
Headless server is headless server - you can't have anything in such case there with X11 either. If you want to forward X11, you need X server, which means it's already not headless.
Instead of X server you can have any Wayland compositor (Wayland server) and whatever part that provides streaming (FreeRDP or what not).
So I don't see how X11 is any better - it's just worse due to having abysmal performance. X11 was never designed for real world remote desktop usage - it just happens to have network transparency. So it's X11 that's a kludge for such scenario if anything.
To me this reads a bit confused, but perhaps I'm misreading it? In X11 terminology the server is sitting in front of you (the one that draws to the screen), so no, you don't need need the remote host to be running X11 server.
You do need the program that draws to the screen, but I think it's fair to say the remote host is headless if it doesn't have a GPU nor a program to interface with the GPU at all. All the remote host needs is code to interact with such a server over TCP or Unix domain sockets. And that code is tiny, even small computers without memory for frame buffer can do it.
> So I don't see how X11 is any better - it's just worse due to having abysmal performance. X11 was never designed for real world remote desktop usage - it just happens to have network transparency. So it's X11 that's a kludge for such scenario if anything.
I think X11 was actually pretty great at the time it was created, i.e. clients can create ids and use them in their requests (no round-trip to the server) and server can contain large client bitmaps that the client can operate on, but sometimes poor client coding can kill the performance over the network. As worst offender I once noticed VirtualBox did a looooot of synchronous property requests during its startup instead of doing them in concurrently, stretching the startup time from seconds to minute or more. (Whether it truly needed those properties in the first place is another question.)
Sending the complete interaction as a video stream? That's what I'd call a hack—though X11 should be modernized in various aspects, for example to support more advanced encodings for media, controlled by the client.
In some sense the web is the direction where I would have liked to see X11 going: still controlled by the client, but some light server-side code could be used to render and interact with the widgets. This way clicks would react immediately, but you would still be interacting with the actual service running on the remote host, not just a local program.
(Another reason why I consider X11 better is the separation of the server and the compositor.)
You can use software rendering for Wayland cases too. There are even OpenGL / Vulkan software implementations.
> All the remote host needs is code to interact with such a server over TCP or Unix domain sockets. And that code is tiny, even small computers without memory for frame buffer can do it
I don't really see much value in such use case. Thin client (the reverse) makes more sense (i.e. where your side is a weak computer and remote server is something more powerful).
But either way, running a compositor even with software rendering should be doable even on low end hardware.
> Sending the complete interaction as a video stream? That's what I'd call a hack
Why not? Video by the mere nature or modern codecs is already very optimized on focusing only on changes to the encoded image, so it's the best option. You render things were they run, then send the video.
It works even for such intense (changes wise) cases as gaming and actual video media. Surely it works for GUIs too.