The fundamental problem with TCP/IP is that it forces the stack to break layering. In the same frame of bytes, you have a confusing mix of ethernet addressing, IP header, and a payload of application data, all from totally different pieces of software in the system. Even the data itself is fragmented, with some session wrapping around content that are done by different application stacks.
> It can't run a UTF-8 decoder first, because that'll turn the C1 controls into replacement characters, and it can't run a terminal sequence decoder first either because that'll treat a lot of the UTF-8 continuation characters as control sequences.
It has to have a state machine which recognizes the combined language of UTF-8 sequences and control sequences. Which is the approach you would take anyway, even with C0 controls.
That combined language is an unambiguous, regular set, so you could code it with your eyes closed.
Starting in an initial state, the legal inputs are: ASCII character, Unicode character, or escape sequence headed by CSI. This is decidable from reading exactly one byte value with no further lookahead.
That's just one way. You can in fact follow a layered approach whereby the terminal decodes everything with UTF-8 before analyzing it for control or data.
For instance, say we decode UTF-8 into integer code points. A valid character decodes into its implied code point. An invalid byte like CSI can decode into some reserved range like U+DCxx. The higher layer of the terminal's firmware then looks for values in that U+DCXX range: that's where it finds the CSI.
I have years of experience with this exact encoding scheme, which I baked into the text I/O streams of a programming language.
For instance, oh, /proc/self/environ is NUL-separated, right? No problem:
1> (file-get-string "/proc/self/environ")
"CLUTTER_IM_MODULE=xim\xDC00XDG_MENU_PREFIX=gnome-\xDC00LANG=en_CA.UTF-8\xDC00;D
[...]
-go:@/tmp/.ICE-unix/2065,unix/sun-go:/tmp/.ICE-unix/2065\xDC00GTK_IM_MODULE=ibus
\xDC00_=/usr/local /bin/txr\xDC00"
The NULs are rendered into \xDC00 codes. This is called the "pseudo-null" character in the terminology of this language, and has a symbolic name: #\pnul:
2> #\xDC00
#\pnul
We can split the data on it to recover the list of environment entries:
3> (take 4 (spl #\pnul *1))
("CLUTTER_IM_MODULE=xim" "XDG_MENU_PREFIX=gnome-" "LANG=en_CA.UTF-8"
"DISPLAY=:1")
In a similar way, any other invalid byte can be used for framing, if the data in between is all valid.