On one hand you don't want a second channel to send the graphics instructions, but on the other you don't want to use in-band escape sequencescorrect. the in-band sequences are dangerous and unwieldy. they don't convey enough information. they are a hack to work within the limitations if historical terminals. that's what this whole thread is about.
a separate graphics channel creates a separate window. then you have two windows. not good either. it needs to be one window, and considering that this window should be able to support multiple remote connections it needs to be local otherwise i would get a new window for each server i connect to. that works for some people, but not for me. and it needs to work through a single channel like ssh/mosh or another similar protocol and be forwardable.
so i want a third option. one approach is sending semantic data, letting the terminal interpret it and display it graphically. this is interesting because shells are already exploring semantic data. (elvish, murex, nushell, others...)
plan9 sounds interesting. i see several efforts to port aspects of it to linux. they all seem to have stalled. more work needs to be done here. that's what i am advocating.