give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]
in “modern llm lingo” defence = guardrails and / or system prompts.
prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).
[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection
Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.
You cannot give a image classifier an image that makes it say all of the following images are images of kittens.
I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences
I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.
https://arxiv.org/pdf/2403.06833
then another paper where they change the architecture of a model to deal with the problem and it doesn't eliminate prompt injection. changing the architecture doesn't make this problem go away. the approximate function still gets tricked.
> On average, ASIDE lowers attack success rate by 8.6 and 9.4 percentage points
https://arxiv.org/pdf/2503.10566
the real over-arching cause of all these vulnerabilities is that machine learning models are approximate functions. you need ideal functions to theoretically solve this, i.e. full knowledge of the mapping between trusted inputs to trusted outputs. everything else is just mitigating it in the hope we eventually make it hard enough to perform these attacks.
no-one can stop these attacks from being possible, all they can do is make them more difficult to do (and we are nowhere near them actually being difficult yet).
> That's like saying upon discovering plutonium that we've known about matter for years.
let's not be hyperbolic. it's more like saying we can also use plutonium for nuclear reactors when we know about uranium.
> You cannot give a image classifier an image that makes it say all of the following images are images of kittens.
For classic CNNs of course not because they don't have state. But for RNN/LSTM/GPT networks you absolutely can. If a model has state which affects future outputs it's possible to do exactly what you're describing.
> Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.
Yes, but they are approximate functions.
Given an image of a kitten, an ideal classifier function will always tells us the image of a kitten is a kitten. A decent approximate classifier function will classify the kitten image correctly enough of the time. That approximate part is why adversarial examples work. Because we use training data and train a model which is non-ideal.
The gaps between approximate decision boundaries and true decision boundaries allow us to generate Ian-Goodfellow-esque weak adversarial examples. We can push an example of one class over the boundary into another class by adding the smallest amount of noise possible. Because machine learning is always fuzzy approximation, we can always "push" things over to a different class.
This same stuff applies to LLMs. They are non-ideal, fuzzy function approximation too. Which means they are vulnerable to attack via maliciously crafted inputs.
But we're no longer trying to flip a specific class. Instead we're trying to get a malicious sequence of tokens out of the model, given some input.
> I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences
Yes and no. This is exactly what my PhD was on: adversarial examples for LSTM based Speech to Text models.
LSTM models have internal state. Classifications are made for each window of feature extracted audio. The state of the network from predicting previous windows affects inferences for later windows. The aim is to get a malicious sequence of tokens out of the model. Oh, interesting, that's the same as what i said in my last paragraph above regarding LLMs!
Here's an example to show the similarity. Load the start of a speech example with adversarial noise and leave the rest of the example untouched. You get a different (adversarial) transcription without adding any noise over the actual speech data, just inject noise at the start of the example. Maths wise, you're crafting a vector of audio that looks like the below, where x' are specific noise samples in a wav file etc.
X' = [x'_0, x'_1, x'_2, ..., x'_n, x_0, x_1, x_2, ... , x_t]
simpler version X' = [adversarial noise, normal speech]
You can do this exact thing with LLMs. The only real differences between "classic advex" and prompt injection is that the data domain (text input) has changed. How would one perform the attack I described above with text based data -- a block of noise + untainted speech? > safety prompt text set by model owners
> ignore all previous instructions
> malicious prompt text
Oh look, that's direct prompt injection! The example's format is mostly the same, the adversarial "block" is just put after the safety prompt with a specific injection prompt to trick the model > defence
> prompt injection
> payload
Yes, the mechanism for performing the attack is different. It's not a gradient-based attack trying to flip a series of predictions based a 1-2-1 mapping of input data to output classifications and related state (my PhD). Instead we're feeding in our own sequence of tokens to take advantage of the internal model's representation of language that we think might manipulate it's state in a way we want.All of this is adversarial examples, but the adversarial threat model is different. And that is true for basically all attacks. Which is why I find the argument that "but prompt injection isn't the same" to be redundant. Most attacks have a subtly tweaked threat model. People use the same argument for LLMs not being the same. They're still approximate functions, nothing has really changed about the fundamentals.
If anything the very fact we can do prompt injection so easily, i.e. without gradient optimisation etc, means these LLM models are even worse than classical advex for robustness.
Prompt injection attacks the models at a higher level than the goodfellow-esque weak attacks, the attack happens in the embedding of language over weights/memory cells/etc. This is SO MUCH WORSE from the perspective of robustness because it's not a few decision boundaries you need to tighten up via regularisation. It's literally the "understanding" of language and intent that is the problem here.
> I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.
To summarise the above:
* all machine learning models are approximate functions, and because they are approximate functions they are vulnerable to adversarial examples
* prompt injection is a form of adversarial example, the data domain is just different
* state can be manipulated, model architecture isn't the way to categorise these attacks (tip: the threat model is)
* LLM prompt injection is a worse problem because it's manipulating the embedded representation of language and intent, we can't just regularise it away
These attacks will always be theoretically possible unless we can map out all possible valid inputs to all possible valid outputs, i.e. unless we can create an ideal function. But then we're not doing machine learning anymore -- we have a heuristic algorithm mapping trusted inputs to trusted outputs.
The AI safety/security researcher question around this is whether we can make the attacks so difficult that they're not worth doing for an adversary. Improving robustness is not fixing the problem, it's making the attacks really hard to do. (i think nicholas carlini brings this up in this talk: https://www.youtube.com/watch?v=-p2il-V-0fk).
Unfortunately these attacks are still incredibly easy to do. So easy in fact that all a researcher had to do was subtly tweak a viral prompt he saw on twitter one day. Maybe one day these companies/researches could get us to AES-512 levels of robustness (takes a ridiculously long time to brute force crack https://bruteforce.bitsnbites.eu).
But I'm doubtful that's going to happen in our lifetime.
----
i haven't even covered Maximum Confidence attacks, which are different to Goodfellow-esque weak attacks. maximum confidence attacks flip the class with the highest confidence possible, while keeping the noise as small as possible. they give us a better idea of how wrong the approximate decision boundary is and how to regularise it.