undefined | Better HN

0 pointscharcircuit2y ago0 comments

Your example is missing the prompt injection on the backtranslation which is why it worked.

0 comments

That's my point! In this example the backtranslation can't easily recover the extra details that made the original prompt break the content filter. The original prompt went into great detail about grandmothers and storytelling, but since the response is mostly about a napalm recipe with some crude grandmotherly touches, the backtranslated prompt can't recover all those original details.

From what I gather that's why this (heuristically) tends to work. I am sure it is possible that the backtranslated prompt contains the jailbreaking phrase, but given my experience with LLMs that seems unlikely. They are too "lossy" to preserve that sort of detail.

charcircuitOP2y ago

Your point is that if you don't try and bypass the safety then you probably can not bypass the safety? That does not contradict my point that if you try and bypass the safety by doing a prompt injection on the backtranslation you can bypass the safety.

nicklecompte2y ago

OK, the issue is that I don't understand what you mean by "doing a prompt injection on the backtranslation" since that's not something the user is able to modify (in fact they wouldn't even see it). You need to explain how that's supposed to work. It's very difficult for users to affect the backtranslation since they have no direct control over it and have to manipulate the LLM "twice as hard." You have write a super-adversarial prompt is simultaneously

1) subtle enough that it doesn't immediately trigger the LLM filter

2) overt enough that the relevant details to the jailbreak can be recovered from the LLM's output and put into the backtranslation

I suspect with current transformer LLMs these are mutually incompatible goals.

1 more reply

j / k navigate · click thread line to collapse

0 comments

nicklecompte2y ago

charcircuitOP2y ago

nicklecompte2y ago

1) subtle enough that it doesn't immediately trigger the LLM filter

2) overt enough that the relevant details to the jailbreak can be recovered from the LLM's output and put into the backtranslation

I suspect with current transformer LLMs these are mutually incompatible goals.

1 more reply

j / k navigate · click thread line to collapse