I'll describe what we've got, but fair warning that I don't know how the write pixels to the screen stuff works on GPUs. There are some instructions with weird names that I assume make sense in that context. Presumably one allocates memory and writes to it in some fashion.
LLVM libc is picking up capability over time, implemented similarly to the non-gpu architectures. The same tests run on x64 or the GPU, printing to stdout as they go. Hopefully standing up libc++ on top will work smoothly. It's encouraging that I sometimes struggle to remember whether it's currently running on the host or the GPU.
The datastructure that libc uses to have x64 call a function on amdgpu, or to have amdgpu call a function on x64, is mostly a blob of shared memory and careful atomic operations. That was originally general purpose and lived on a prototypey GitHub. Its currently specialised to libc. It should end up in an under-debate llvm/offload project which will make it easily reusable again.
This isn't quite decoupled from vendor stuff. The GPU driver needs to be running in the kernel somewhere. On nvptx, we make a couple of calls into libcuda to launch main(). On amdgpu, it's a couple of calls into libhsa. I did have an opencl loader implementation as well but that has probably rotted, intel seems to be on that stack but isn't in llvm upstream.
A few GPU projects have noticed that implementing a cuda layer and a spirv layer and a hsa or hip layer and whatever others is quite annoying. Possibly all GPU projects have noticed that. We may get an llvm/offload library that successfully abstracts over those which would let people allocate memory, launch kernels, use arbitrary libc stuff and so forth running against that library.
That's all from the compute perspective. It's possible I should look up what sending numbers over HDMI actually is. I believe the GPU is happy interleaving compute and graphics kernels and suspect they're very similar things in the implementation.