In this threat model, no one tries to pre-verify that the code doesn't do anything bad - indeed, thanks to the halting problem, we know this is generally impossible to do - so the usual approach is to sandbox the JavaScript interpreter itself and ensure it can only access pre-approved resources.
I think a similar approach would be reasonable for LLMs. Trying to teach boundaries to the model itself is always going to be an error-prone cat and mouse game. It seems much more practical to me to restrict the IO of the model and treat any model outputs like you'd treat untrusted, user-provided inputs in a conventional system.
The policy above is good advice for a ton of ML models with poorly understood behaviors, like biased image recognition nets. LLMs are simply harder to trust because their behavior can be so variable based on inputs.
Prompt injection is an interesting species of attack, but it doesn't really change the threat surface. Prompt programming isn't reliable enough to be depended on for guarantees in the first place, and outputs can be dangerous with or without injection.
I predict nearly all of the upcoming LLM products will end up being fancy autocomplete suggestions a user will then have to feed into a more constrained system with some sort of manual confirmation/tweaking.
And all these hacks are going away in the next point release, they just need to collate them all and add them to the training set. There are still going to be adversarial attacks though. That's hard to guard against, but they won't be created manually, we'll need algorithms to find them.