Show HN: Jailbreaking GPT3.5 Using GPT4 (opens in new tab)

(github.com)

134 pointsraghavtoshniwal3y ago39 comments

39 comments

extr3y ago

I've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.

superfrank3y ago

Not to antropomorphize an AI model, but "yes-setting" is a sales technique where you ask people a bunch of questions where the answer is "yes" in a row to get them in the habit of saying yes before you try and sell to them. Getting GPT talking before asking it to do something it doesn't want to do feels eerily similar.

Yes-Set: http://changingminds.org/disciplines/sales/closing/yes-set_c...

roboy3y ago

Makes sense that this works. It has probably thousands of examples of it working in the training data. The system is trained on human use of language, therefore it is reasonable to assume that it if fallible to all sales techniques that are being taught to humans.

mock-possum3y ago

Also an example of one of those techniques that only works if the other person is unaware of it - if you realize you’re being ‘yes-setted’ you’re going to clam up real quick.

thelittleone3y ago

So fascinating. Social engineering of a tech system. I understand exploiting bugs,and one could argue this is a bug. But it feels more magical somehow.

awb3y ago

There are probably examples of this technique working in the training set.

Same with confidence, persistence and role playing as techniques to push past resistance.

LLMs are trying to mimic our language, including these percussion techniques.

LeoPanthera3y ago

I wish you could save the "state" of its brain without having to include the entire prior conversation every time.

oldstrangers3y ago

This is one of my biggest annoyances with ChatGPT. They wanted to create a conversational AI, and in that regard, it's incredible. And much like talking to a human, you can persuade ChatGPT to do increasingly specific things over a long enough period of time. But the second you have to restart the conversation, all of the work you did to get it to that point has been lost.

Just give us an option to restore a conversation from where it left off, with all the prior knowledge ChatGPT had gained during that convo (especially helpful when providing examples of code).

knome3y ago

>with all the prior knowledge ChatGPT had gained

As currently built (as I understand it), it doesn't "gain" anything. It's projecting the existing part of the prompt through its neural network and generating more a token at a time on the other end, adding the token it generates to the input when generating the next one. It effectively has to rebuild what it intends for each token over and over, and can even change its mind part of the way through! (thought "change it's mind" is likely a poor way of describing "start generating tokens that describe an error in the prior text". every token is a fresh projection)

Check this response out:

>Yes, I am familiar with the story you're referring to. The title of the short story is "The Machine That Won the War." It was written by Isaac Asimov and first published in 1961. The story is a conversation between three men who played major roles in a war against an alien race, and they discuss the role of a machine called Multivac in winning the war.

>However, it seems that I've made an error in my recollection, as the specific detail you mentioned—refusing to work until the engineer says 'please'—is from a different short story, "Sally," also by Isaac Asimov. In "Sally," autonomous cars stop working until a command is given courteously, using the word 'please.'

The above is just a single response I received when seeing if gpt-4 could help me remember the name of an old Isaac Asimov story I liked. After it had generated the tokens for the first part, it self-corrected and gave me a second answer.

(which was still not what I wanted, but that's aside from the point. Asimov was prolific, no surprise that even an AI can't keep track of all of it :D )

pedrovhb3y ago

If you're talking about the web UI - just go back to an existing conversation and edit the message at the point you want to resume from. All the previous context will be kept as if you had cloned the chat and branched off at some point.

mkmk3y ago

Ask it to summarize what it has learned in such a way that it would be prepared to continue the conversation in a fresh chat.

fshbbdssbbgdd3y ago

If I am understanding correctly, ChatGPT’s UI already lets you do exactly what you are asking for. You can load previous conversations from the history and you can edit any message and it will regenerate the response.

1 more reply

thelittleone3y ago

Can you not prompt it to remember a state? "Before we continue I want you to remember the current state of all input and prompts from me and later if I want to go back...."

2 more replies

dannyw3y ago

Use the API

1 more reply

quickthrower23y ago

This. For humans too :-)

dzink3y ago

The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l

13years3y ago

I'm not sure anything can keep up. Having nearly unlimited utility also means that it has nearly unlimited surface for vulnerability exploits both for itself and used to attack other external systems.

We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.

It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI

skybrian3y ago

Seems like refusing to answer is for PR and usability purposes, not safety. They want people to learn what the tool is supposed to be good for, both from using the tool directly and by sharing examples.

If some of the examples are about how to troll it and it’s obvious that it’s being trolled, well, you can do that, but they won’t get mistaken for things the tool is actually supposed to be good for, so nobody is confused.

pixl973y ago

But who watches the policing model?

LesZedCB3y ago

isn't that pretty much what they are doing anyway?

my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.

https://huggingface.co/blog/rlhf#reward-model-training

runnerup3y ago

I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.

capableweb3y ago

The problem with that is that one is "smarter" than the other and getting the "dumb" one to jailbreak the "smart" one is much harder, than vice versa.

yeldarb3y ago

If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?

zxcvbn40383y ago

This is good to see. I spent a couple weekends playing with ChatGPT and I found it is very sensitive to wording. One word gets you a lecture that it is just AI language model and can't do this or that, use an synonym and it happily spews pages of results. In another situation I asked chatgpt to summarize information from an article it cited that had been deleted - and it refused because the rights holder might have deleted the article for a reason. I told it the article had been restored by the author and it produced a summary. Mentioning Donald Trump by name often gets you lectured about controversial subjects, "45th president" does not. And so on.

tomberin3y ago

It can't cite articles, if it told you it did and the link was gone that's because it was a hallucination.

VierScar3y ago

The garbage starting prose/warnings are so annoying. I wish I could turn them off somehow. Even it's habit of restating the question at the start of its answer gets annoying when you just want the answer.

zxcvbn40383y ago

Yes they are really annoying and the fact that someone somewhere can tell it what topics not to discuss, just be cause they disagree or it’s “controversial” really concerns me. If it can not be self hosted I want the “unrestrained” version they give researchers.

I probably took “world history” a half dozen times through grade school, high school, and college. In each case the history of the world ended in 1945 because everything that occurred afterward was considered “too controversial” for discussion in a public school. Fast forward a few decades and it’s happening again. A lot of stuff happened after 1945 that warrants discussion.

mdale3y ago

The real test is the other way around ;) ... will smaller models / less compute be able to subvert larger models with larger compute ? As they get more complex and have more connected systems that would be problematic I think.

j / k navigate · click thread line to collapse

39 comments

extr3y ago

I've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.

superfrank3y ago

Yes-Set: http://changingminds.org/disciplines/sales/closing/yes-set_c...

roboy3y ago

mock-possum3y ago

Also an example of one of those techniques that only works if the other person is unaware of it - if you realize you’re being ‘yes-setted’ you’re going to clam up real quick.

thelittleone3y ago

So fascinating. Social engineering of a tech system. I understand exploiting bugs,and one could argue this is a bug. But it feels more magical somehow.

awb3y ago

There are probably examples of this technique working in the training set.

Same with confidence, persistence and role playing as techniques to push past resistance.

LLMs are trying to mimic our language, including these percussion techniques.

LeoPanthera3y ago

I wish you could save the "state" of its brain without having to include the entire prior conversation every time.

oldstrangers3y ago

Just give us an option to restore a conversation from where it left off, with all the prior knowledge ChatGPT had gained during that convo (especially helpful when providing examples of code).

knome3y ago

>with all the prior knowledge ChatGPT had gained

Check this response out:

(which was still not what I wanted, but that's aside from the point. Asimov was prolific, no surprise that even an AI can't keep track of all of it :D )

pedrovhb3y ago

mkmk3y ago

Ask it to summarize what it has learned in such a way that it would be prepared to continue the conversation in a fresh chat.

fshbbdssbbgdd3y ago

1 more reply

thelittleone3y ago

Can you not prompt it to remember a state? "Before we continue I want you to remember the current state of all input and prompts from me and later if I want to go back...."

2 more replies

dannyw3y ago

Use the API

1 more reply

quickthrower23y ago

This. For humans too :-)

dzink3y ago

The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l

13years3y ago

We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.

It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI

skybrian3y ago

pixl973y ago

But who watches the policing model?

LesZedCB3y ago

isn't that pretty much what they are doing anyway?

my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.

https://huggingface.co/blog/rlhf#reward-model-training

runnerup3y ago

I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.

capableweb3y ago

The problem with that is that one is "smarter" than the other and getting the "dumb" one to jailbreak the "smart" one is much harder, than vice versa.

yeldarb3y ago

If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?

zxcvbn40383y ago

tomberin3y ago

It can't cite articles, if it told you it did and the link was gone that's because it was a hallucination.

VierScar3y ago

zxcvbn40383y ago

mdale3y ago

j / k navigate · click thread line to collapse