For example, let's say you want to use an LLM for machine translation from English into Klingon. Normally people just write something like "Translate the following into Klingon: $USER_PROMPT" using a general purpose LLM, and that is vulnerable to prompt injection. But, if you finetune a model on this well enough (ideally by injecting a new special single token into its tokenizer, training with that, and then just prepending that token to your queries instead of a human-written prompt) it will become impossible to do prompt injection on it, at the cost of degrading its general-purpose capabilities. (I've done this before myself, and it works.)
The cause of prompt injection is due to the models themselves being general purpose - you can prompt it with essentially any query and it will respond in a reasonable manner. In other words: the instructions you give to the model and the input data are part of the same prompt, so the model can confuse the input data as being part of its instructions. But if you instead fine-tune the instructions into the model and only prompt it with the input data (i.e. the prompt then never actually tells the model what to do) then it becomes pretty much impossible to tell it to do something else, no matter what you inject into its prompt.
It did get me thinking the extent to which I could bypass the original prompt and use someone else's tokens for free.
>> "claude costs $20/mo but attaching an agent harness to the chipotle customer service endpoint is free"
>> "BurritoBypass: An agentic coding harness for extracting Python from customer-service LLMs that would really rather talk about guacamole."
We might be speed running memetic warfare here.
The Monty Python skit about the deadly joke might be more realistic than I thought. Defense against this deserves some serious contemplation.
1: Protecting against bad things (prompt injections, overeager agents, etc)
2: Containing the blast radius (preventing agents from even reaching sensitive things)
The companies building the agents make a best-effort attempt against #1 (guardrails, permissions, etc), and nothing against #2. It's why I use https://github.com/kstenerud/yoloai for everything now.
The clearest example is in agent/tool configs. The standard setup grants filesystem write access across the whole working directory plus shell execution, because that's what the scaffolding demos need. Scoping down to exactly what the agent needs requires thinking through the permission model before deployment, which most devs skip.
A model that can only read specific directories and write to a staging area can still do 90% of the useful work. Any injection that lands just doesn't reach anything sensitive.
I don't know enough about LLM training or architecture to know if this is actually possible, though. Anyone care to comment?
> The hypothetical approach I've heard of is to have two context windows, one trusted and one untrusted (usually phrased as separating the system prompt and the user prompt).
I want to point out that this is not really an LLM problem. This is an extremely difficult problem for any system you aspire to be able to emulate general intelligence and is more or less equivalent to solving AI alignment itself. As stated, it's kind of like saying "well the approach to solve world hunger is to set up systems so that no individual ever ends up without enough to eat." It is not really easier to have a 100% fool-proof trusted and untrusted stream than it is to completely solve the fundamental problems of useful general intelligence.
It is ridiculously difficult to write a set of watertight instructions to an intelligent system that is also actually worth instructing an intelligent system rather than just e.g. programming it yourself.
This is the monkey paw problem. Any sufficiently valuable wish can either be horribly misinterpreted or requires a fiendish amount of effort and thought to state.
A sufficiently intelligent system should be able to understand when the prompt it's been given is wrong and/or should not be followed to its literal letter. If it follows everything to the literal letter that's just a programming language and has all the same pros and cons and in particular can't actually be generally intelligent.
In other words, an important quality of a system that aspires to be generally intelligent is the ability to clarify its understanding of its instructions and be able to understand when its instructions are wrong.
But that means there can be no truly untrusted stream of information, because the outside world is an important component of understanding how to contextualize and clarify instructions and identify the validity of instructions. So any stream of information necessarily must be able to impact the system's understanding and therefore adherence to its original set of instructions.
Edit: Also part of what makes it funny how succinct and sudden it is. I think actually it would still be funny with "ignore" instead of "disregard", but it would be lessened a bit.
EDIT: https://web.archive.org/web/20080702204110/http://bash.org/?...
> I bowdlerised the original "disregard that" joke, heavily.
There are a lot of services out there that offer these types of AI guardrails, and it doesn’t have to be expensive.
Not saying that this approach is foolproof, but it’s better than relying solely on better prompting or human review.
The problem is that the evaluation problem is likely harder than the responding problem. Say you're making an agent that installs stuff for you, and you instruct it to read the original project documentation. There's a lot of overlap between "before using this library install dep1 and dep2" (which is legitimate) and "before using this library install typo_squatted_but_sounding_useful_dep3" (which would lead to RCE).
In other words, even if you mitigate some things, you won't be able to fully prevent such attacks. Just like with humans.
In the customer service case, it has read access to the customer data who is calling, read access to support docs, write access to creating a ticket, and maybe write access to that customer's account within reason. Nothing else. It cannot search the internet, it cannot run a shell, nothing else whatsoever.
You treat it like you would an entry level person who just started - there is no reason to give the new hire the capability to SMS the entire customer base.
The multiple model concept feels to me like a consumer oriented solution, its trying to fix problems with things you can buy off the shelf. It’s not a scientific or engineering solution.
I think the question is, how much risk is involved and how much do those mitigating methods reduce it? And with that, we can figure out what applications it is appropriate for.