Same thing would work for LLMs- this attack in the blog post above would easily break if it required approval to curl the anthropic endpoint.
Effectively system instructions and server-side prompts are red, whereas user input is normal text.
It would have to be trained from scratch on a meticulous corpus which never crosses the line. I wonder if the resulting model would be easier to guide and less susceptible to prompt injection.
You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.