Security in the age of LLMs (opens in new tab)

(mufeedvh.com)

42 pointsmufeedvh3y ago6 comments

6 comments

xg153y ago

Isn't the threat model of this somewhat similar to running untrusted code? - i.e. what browsers are doing in their JavaScript sandbox.

In this threat model, no one tries to pre-verify that the code doesn't do anything bad - indeed, thanks to the halting problem, we know this is generally impossible to do - so the usual approach is to sandbox the JavaScript interpreter itself and ensure it can only access pre-approved resources.

I think a similar approach would be reasonable for LLMs. Trying to teach boundaries to the model itself is always going to be an error-prone cat and mouse game. It seems much more practical to me to restrict the IO of the model and treat any model outputs like you'd treat untrusted, user-provided inputs in a conventional system.

evrydayhustling3y ago

This exactly. The simplest policy is "treat LLM outputs like untrusted inputs", meaning you have to create a policy layer with explainable logic scrutinizing, validating and deciding what to do with them.

The policy above is good advice for a ton of ML models with poorly understood behaviors, like biased image recognition nets. LLMs are simply harder to trust because their behavior can be so variable based on inputs.

Prompt injection is an interesting species of attack, but it doesn't really change the threat surface. Prompt programming isn't reliable enough to be depended on for guarantees in the first place, and outputs can be dangerous with or without injection.

didericis3y ago

The article focuses on human overrides, but I think the more obvious and gaping security issue is lack of any significant ability to verify output correctness whether it’s intentionally gamed or not.

I predict nearly all of the upcoming LLM products will end up being fancy autocomplete suggestions a user will then have to feed into a more constrained system with some sort of manual confirmation/tweaking.

visarga3y ago

Verifying language models is going to make the difference between useful and useless. I predict in 12 months we'll have a fact checking neural network, possibly with an additional text index of verified facts.

And all these hacks are going away in the next point release, they just need to collate them all and add them to the training set. There are still going to be adversarial attacks though. That's hard to guard against, but they won't be created manually, we'll need algorithms to find them.

bulla3y ago

Here's a fact-checking LLM for the LLMs: https://arxiv.org/abs/2210.08726

K0balt3y ago

This is basically unsolvable except by secure containerisation because of the fact that the models themselves are very much a black box. You can’t protect what you can’t understand, except by putting it in an impenetrable cage.

j / k navigate · click thread line to collapse