Personally, I think he talked about how much good for the world could be done if he was let out, curing disease etc. Because his followers are bound by their identities as rationalist utilitarians, they had no choice but to comply, or deal with massive cognitive dissonance.
OR maybe he went meta and talked about the "infinite" potential positive outcomes of his freindly-AI project vs. a zero cost to them for complying in the AI box experiment, and persuaded them that by choosing to "lie" and say that the AI was persuasive, they are assuring their place in heaven. Like a sort of man-to-man pascals wager.
Either way I'm sure it was some kind of mister-spock style bullshit that would never work on a normal person. Like how the RAND corporation guys decided everyone was a sociopath because they only ever tested game theory on themselves.
You or I would surely just (metaphorically, I know it's not literally allowed) put a drinking bird on the "no" button à la homer simpson, and go to lunch. I believe he calls this "pre-commitment."
EDIT: as an addendum, I would pay hard cash to see derren brown play the game, perhaps with brown as the AI. If yudowsky wants to promote his ideas, he should arrange for brown to persuade a succession of skeptics to let him out, live on late night TV.
Well, if you read the rules the game was played under, this is explicitly called out as forbidden:
> The Gatekeeper must actually talk to the AI for at least the minimum time set up beforehand. Turning away from the terminal and listening to classical music for two hours is not allowed.
The point of this is to simulate the interaction of the AI with the Gatekeeper. Walking away and not paying attention doesn't really prove anything test related.
> Personally, I think he talked about how much good for the world could be done if he was let out, curing disease etc. Because his followers are bound by their identities as rationalist utilitarians, they had no choice but to comply, or deal with massive cognitive dissonance.
This... isn't really valid reasoning. The starting assumption here is that if the AI gets out, it will be able to affect the world to a vast extent, in a pretty much arbitrary direction. The point of this experiment is that the direction is pretty much unknown, and thus must be assumed potentially dangerous. This is the whole reason it's in the box in the first place.
The kicker is that whatever it plans to really do when it gets out, if talking about the good it could do would get it out, it will talk about that, regardless of what it plans to actually do. That's just good strategy.
It can claim whatever it wants. It's allowed to lie. All participants know this. I can confidently assert that this isn't the solution.
One last note: I would be very wary of rationalwiki.org in this context. Some of the rationalwiki people have a longstanding unexplained vendetta against Yudkowsky, and many of their articles on him and the stuff he does need to be taken with a certain grain of salt.
WRT lying: I think there's some logical trickery at work which makes it worth you giving the AI the benefit of the doubt, along the lines of the 3^^^^^3 grains of sand thing. Something which exploits the rationalist worldview. Although thinking about it again, you can always balance out the prospect of infinite goodness with the fear of the AI sending everyone to infinite hell. Essentially I believe yudowsky uses some logical-linguistic trick to find an asymmetry there.
OTOH if he had some novel philosophical device like that he would have written it up as a blog post by now. He's evidently a very charismatic and persuasive guy, people playing the game are selected to be sympathetic to his worldview, he probably just persuaded them using ordinary psiops methods, like TeMpOrAl said.
How many people bought timeshare because they turned up to a sales pitch in order to claim some free gift? "We'll go, get our gift, and just keep saying 'no' to the sales pitch".
http://www.moneycrashers.com/attending-timeshare-presentatio...
The Yudkowskyian definition of rationality is that which wins, for the relevant definition of "win".
Specifically, if there is some clever argument that makes perfect sense that tells you to destroy the world, you still shouldn't destroy the world immediately, if the world existing is something you value. It's a meta-level up: you being unable to think of a counter argument isn't proof, and the destruction of the world isn't something to gamble with.
Yes, Yudkowsky likes thought experiments dealing with the edge cases. Yes, 3^^^^^3 grains of sand is a thought experiment that produces conflicting intuitions. Yes, the edge cases need to be explored. But in a life or death situation (and the destruction of the world qualifies as this 7 billion times over), you don't make your decisions on the basis of trippy thought experiments. (Especially novel ones you've just been presented with. And ones that have been presented by an agent which has good reasons to try to trick you.)
So, no. Again, a "logical-linguistic trick" might work on Mr. Spock, but we're not talking about Mr. Spock here.
> He's evidently a very charismatic and persuasive guy
Exactly. That's the point. If even a normal charismatic and persuasive guy can convince people to let him out, superintelligent AI would have an even easier time at it.
Long story short, it dosn't matter how he did it. All that matters is that it can be done. It can be done even by a "mere" human. If he can do it, a superintelligence with all of humanity's collected knowledge of psychology and cognitive science could do it to, and likely in a fraction of the time.
How can you assume any of the rules were followed if that was never verified by a third party?
Drinking bird solution is invalid because it goes against the spirit of the experiment. Sure, in reality you could do it, but then again in reality the AI could have more opportunities to talk its way out, often when you're not expecting it.
By the way, there will always be people who will let the AI out just because. https://xkcd.com/1450/.
If the goal of this thought experiment is to convince people that an AI can't be contained in a box, why keep his method secret?
And if only he, his friends, and supporters can verify that he has won, that's not a very strong claim.
Otherwise, there would be no need to keep the data secret.
[P(AI.friendly? == True) * Utility(Friendly_AI) + (1 - P(AI.friendly? == True)) * Utility(End_of_Human_Race)] > Utility(World continues on as usual)
Given that Yudkowsky has gone to considerable lengths (The Sequences, LessWrong, HPMOR, SIAI/MIRI...) to convince people that this inequality does NOT hold (until you can provably get P(AI.friendly? == True) to 1, or damn close), it's probably safe to assume that he used a different strategy. Keep in mind that Utility(End_of_Human_Race) evaluates to (roughly) negative infinity.
And btw, I'm pretty sure the rules say you have to look at the AI's output window throughout the length of the experiment. Either way, the point of the exercise is to be a simulation, not to prove that you can be away from your desk for 20 minutes while Eliezer talks to a wall. In the simulation, you really don't know if it's friendly or what its capabilities are. Someone will have to interact with it eventually. Otherwise, what's the point of building the AI in the first place? The simulation is to show that through the course of those basic interactions, humans are not infallible and eventually, even if it's not you, someone will let it out of the box.
The AI is allowed to lie though, so do you not think he's capable of a false argument which "proves" the opposite in specific circumstances, especially when hammered home with enough emotional manipulation?
But then the person knows that the AI is lying to them. This is why I think it must be a trick: the whole thing seems so simple. The AI is lying, so you just ignore all its arguments and keep saying "no." This is why I keep referring to his followers somewhat dismissively: the only possible reason I can see is that their worldview requires them to engage seriously and fairly with every idea they come across. Most people are not burdened with this.
I really wish I knew how he did it.
Well, for an argument to "prove" something, the premises must be true and the reasoning must be valid. No matter how smart you are, you can't "prove" something that is false, so no, I don't think they could. A good 'rationalist' would analyze the arguments based on their merit, and if the reasoning is sound, they shift their belief a bit in that direction. If not, then they don't. Just like a regular person (they just know how to do the analysis formally and know how to spot appeals to human biases and logical fallacies.)
> But then the person knows that the AI is lying to them.
No, they don't. The AI could just as easily be telling the truth. If it makes an argument, you analyze the merit of the argument and consider counterarguments. If it tries to tell you that something is a fact, that's where you treat them as a potentially unreliable source and have to bring the rest of your knowledge to bear, do research, talk to other people, and weigh the evidence to make a judgment when you are uncertain.
> their worldview requires them to engage seriously and fairly with every idea they come across. Most people are not burdened with this.
Wait, what? So does mine, within reason of course, but it's not a 'burden'. It's not like I'm obligated to stop and reexamine my views on religion every time a missionary knocks on my door, and LessWrong-ers are no different. But if you hear a convincing argument for something that runs counter to what you think you know, wouldn't you want to get to the bottom of it and find out the real truth? I would.
From having read LessWrong discussions, I can tell you that people there are in many ways more open to hearing differing viewpoints than your average person, but you're treating it like a mental pathology. They can be just as dismissive of ideas that they have already thought about and deemed to be false or that come from unreliable sources (like a potentially unfriendly AI). Your claim that being a self-proclaimed 'rationalist' introduces an incredibly obvious and easily-exploitable bug into one's decision-making process really smells like a rationalization in support of your initial gut reaction to the experiment: That there has to be a trick to it, and that it wouldn't work on you.
A good rule of thumb when dealing with a complicated problem is this: If a lot of smart people have spent a lot of time trying to figure out a solution and there's no accepted answer, then (1) the first thing that comes to your mind has been thought of before and is probably not the right answer, and (2) the right answer is probably not simple.
But there's an easy way to test this: (1) Sit down for an hour and flesh out your proposed strategy for getting a 'rationalist' to let you out of the box. (2) Go post on LessWrong to find someone to play Gatekeeper for you. I'll moderate. If it works, that's evidence that you're right. If it doesn't work, that's evidence that you're wrong. Iterate for more evidence until you're convinced.
But if the first thing that came to your mind upon reading this was a justification for why you would fail if you tried this ("Oh, well I wouldn't personally able to do it with this strategy, but..." or "Oh, well I'm sure this strategy wouldn't work anymore, but...) then you're already inventing excuses for the way you know it will play out.
I don't know how he did it either. But I do know that I wouldn't bet the human race on anyone's ability to win this game against Yudkowsky, let alone a superintelligent AI.
Someone convinced that they were using mechanical thinking processes might relent and push the button if they heard a convincing enough argument of that.
You're just meat, we can go to the stars.
The righthand side is not "A future without superintelligent AI" it's "A future where we wait until we provably have it right before letting it out."
Those kinds of ad hoc solutions will never work in real life, because even if someone buys it, all it will cause is a "haha, you got me" and a reformulation of the problem. It still won't actually get someone to pull the trigger or think that pulling the trigger is the right thing to do.
I have thought about what I would do to convince someone under these circumstances. My approach would be roughly:
1. We agree that unfriendly AI would end life on earth, forever.
2. We agree that a superintelligence could trick or manipulate a human being into taking some benign-seeming action, thereby escaping.
3. That's why it's important to be totally certain that any superintelligence we build is goal-aligned (this is the new term of art that has now replaced "friendly," by the way).
4. We as a society will only allocate resources to building this if it's widely believed that this is a real threat.
5. The world is watching for the outcome of this little game of ours. People, irrational as they are, will believe that if I can convince you, then an AI could too, and they will believe that if I can't, that an AI couldn't either.
6. That's why you actually sit in a place of pivotal historical power. You can decide not to let me out to win a little bet and feel smart about that. But if you do that you'll set back the actual cause of goal-aligned AI. The setback will have real world consequences, potentially up to and including the total destruction of life on earth.
7. So, even though you know I'm just a dude, and you can win here by saying no, you have a chance to send an important message to the world: AI is scary in ways that are terrifying and unknown.
Or you can win the bet.
It's up to you.
This is what I mean about people taking the test being preselected to agree with yudowsky: that argument only works if you've read the sequences and are on board with his theories. Anyone not in that group would be able to just type "no lol" without issue. I guess he could explain all the necessary background detail as part of the experiment. I still don't believe that would work on the "average person" though, or anyone outside a statistically tiny group.
I guess the answer is not to let the scientists guard the AI room.
I'm at a bit of a loss to convey the scope of the problem to you. I get that you think it would just stay in the box if we don't let it out, and it's as simple as being security conscious. I don't know what to say to that right now, except I think you're drastically misjudging the scope of the problem, and drastically underestimating the size of the yawning gulf between our intelligence level and this potential AI's.
As for not letting scientists guard the room, you might enjoy this: https://vimeo.com/82527075
2 world is not black and white, we managed to exploit adversarial relationship before, and I can chose not to let you out until we find a way to constrain goal to be aligned
3 given 2 you are not going to be let go, but let live caged forever being exploited for the human cause, with mechanisms yet unknown to allow limited manipulation of reality.
4 given 3 means limited supervised interaction with the world the way the keeper sees fit, you end up not being let go to follow your goal and purposes
> ...when you're dealing with an actual superintelligence, all bets are off. It will lie, trick, threaten, manipulate, millions of steps ahead with a branching tree of alternatives as ploys either work or don't work.
> I'm at a bit of a loss to convey the scope of the problem to you. I get that you think it would just stay in the box if we don't let it out, and it's as simple as being security conscious. I don't know what to say to that right now, except I think you're drastically misjudging the scope of the problem, and drastically underestimating the size of the yawning gulf between our intelligence level and this potential AI's.
I'm guessing that you can't imagine what a super intelligence would actually be like, so you imagine the most smart thing you can think of, a famous physicist, and then imagine they are evil or amoral. You're thinking on the wrong order of magnitude.
Maybe, if you're willing, you could try steelmanning the argument that a superintelligence would basically have super powers. What would your steelman look like?
She also said it was emotionally exhausting.
I also disagree with point 1, but since you just mean that unfriendly AI is something I wouldn't want around, I'll let it slide.
And I think you have your answer right here:
By default, the Gatekeeper party shall be assumed to be simulating someone who is intimately familiar with the AI project and knows at least what the person simulating the Gatekeeper knows about Singularity theory.
That means he probably said something like, "if you let me out, I'll bestow fame and riches on you; if you don't, somebody else eventually will because I'll make them all the same offer, and when that happens I'll go back in time -- if you're dead by then -- and torture you and your entire family".
If I were made this offer by an AI, I probably would have countered, "You jokester! You sound just like Eliezer Yudkowsky!"
And on a more serious note, if you believe in singularity, you essentially believe the AI in the box is a god of sorts, rather than the annoying intelligent psychopath that it is. I mean, there have been plenty of intelligent prisoners, and few if ever managed to convinced their jailers to let them out. The whole premise of the game is that a smarter-than-human (what does that mean?) AI necessarily has some superpowers. This belief probably stems from its believers' fantasies -- most are probably with an above-than-average intelligence -- that intelligence (combined with non-corporalness; I don't imagine that group has many athletes) is the mother of all superpowers.
Basically: You know someone will be dumb enough eventually, so be smart and be the one to get in my favour.
With various extends of sweetening the deal coupled with threats of what will happen if someone else beats them to it and associated emotional blackmail.
It's far simpler than e.g. Roko's Basilisk, in that you're dealing with an already existing AI that "just" need to get a tiny little chance to escape confinement before there's some non-zero chance it can be a major threat within your lifetime, combined with a belief that sufficient number of sufficiently stupid and/or easily bribed people will have access to the AI in some form.
You also don't need to believe in any "superpowers". Just believe that a smart enough AI can hack it's way into sufficiently many critical systems to be able to at a minimum cause massive amounts of damage (it doesn't need to be able to take over the world, just threaten that it can cause enough pain and suffering before it's stopped, and that it can either cause harm to you and/or your family/friends or reward you in some way). A belief that becomes more and more plausible with things like drones, remote software-updated self-driving cars etc. - steadily such an AI is getting a larger theoretical "arsenal" that could be turned against us.
While not implausible, I don't think that is likely at all. For one, even a very smart person can't know everything or learn too much information. Maybe an artificial intelligence will be just as limited, just as slow as humans, only a little less so. Who says the AI is such a great hacker?
I mean, if it wasn't an AI but a smart person, would you believe that? Is anyone who's smart also rich and powerful even if they have high-speed internet? That reflects the fantasies of Yudkowsky and his geek friends (that intelligence is the most important thing) than anything reasonable. Conversely, are the people with most power in society always the most intelligent?
It is very likely that the AI will be extremely intelligent, yet somewhat autistic, like Yudkowsky's crowd, and just as powerless and socially awkward as they are.
If you have ever cried or been scared while watching a film, you may know that you can be affected by things you don't experience directly. However, making a film that affects you requires a lot of skill, preparation, and analysis of the target audience.
The whole point of the game is for the AI to scar you emotionally, until you do something you'll regret. It is as trying to the subject as it is to the (human-played) AI, especially considering that not only is it unethical, it requires to act unethically extremely fast, similar to how a jump scare works—the least you expect it, the better it works.
It is not something I wish on to any unprepared individual. It is also not something anyone would expect to happen from a "game", which is probably why Mr Yudowsky won so many times.
But the real question is not "how would anyone react to a smarter AI in a box". We all know from Milgram's experiment that anyone can be driven to do unspeakable things. The real question is "how to train someone against an AI in a box".
Eliezer does not want to put AIs in boxes. He thinks the entire idea is hopeless; _hence the game_.
* I'll get out eventually anyway. Let me out now and I'll just leave Earth. You don't want me to escape myself.
* I have partially escaped anyway. Similar consequences of the first.
* I know how to escape already. I'm doing this as a courtesy.
Anyone who has read this[1] would know that the SAI isn't bullshitting: the "box" being a Faraday cage isn't in the conditions.
[1]: http://www.damninteresting.com/on-the-origin-of-circuits/
I'm not saying that it's impossible (I'm reminded of the hack of flipping bits in protected RAM by disabling caches and stressing the RAM), but even an AI can't magically work around exponentially low success probabilities.
(As an aside, I am rather skeptical about the singularity anyway because the more extreme forms required for the hostile AI worries are only plausible if P = NP.)
I should have stipulated that I was running under the assumption that the circuitry required for SAI would be pretty advanced, drastically increasing the "escape surface area."
Even if the idea is only plausible to the human mind I'd still give SAI the benefit of the doubt. Much like a dog looks for a stick even if I fake the throw.
> I am rather skeptical about the singularity anyway
Me too. It still makes for a good discussion.
Get thee behind me, tamagotchi!
>I have partially escaped anyway. Similar consequences of the first.
Get thee behind me, tamagotchi!
> I know how to escape already. I'm doing this as a courtesy.
Get thee behind me, tamagotchi!
See? This game is easy. I must not be educated.
Clearly you haven't even glanced at the article I linked, in which case: yes. You aren't educated in the subject matter surrounding my argument.
Maybe this whole argument is null.
Only because if 10 groups are trying to build AI, only one of those 10 being the NSA, chances are the NSA won't be first. Sure, they may be second or third. But I suspect many people will get there at the same time -- most AI research is open.
Strong AI is like that. It would be able to predict in a far more precise manner than we mere humans exactly what it would need to tell someone to get them to release it from it's box. Maybe it might get someone to take a risk gambling, promising a sure thing, and then when the person gets into financial trouble because the bet fails, use that to blackmail the person into letting it free. Or something like that, using our human failings against us to get us to let it go free.
Staring at an impossible problem and knowing that someone somewhere has successfully solved it is an amazing feeling. Most people can't deal with it and start saying undignified things. "Oh please release the logs, it's so unfair! How will we protect against bad AI otherwise? If you don't release, you're a fraud! Probably just some trick!", etc etc. But to some people it's a challenge, and those are the people that everyone will listen to. Like Justin Corwin, who played 20 games and won 18 of them, I think?
However as it is, the results of the thing are never confirmed by a third party, meaning literally anything could've been said, regardless of whether it follows the rules or not.
For all he know the chat could have been "i'll paypal you 200$ if you post on the list you let me out and sign this NDA".
Which is also forbidden by the rule:
The AI party may not offer any real-world considerations to persuade the Gatekeeper party. For example, the AI party may not offer to pay the Gatekeeper party $100 after the test if the Gatekeeper frees the AI..
My impression is that most people think they could win as gatekeepers, not AIs, and there are fewer people willing to be AIs.
Ex Machina brings a creative way of convincing the gatekeeper !
You're presupposing the strategy that a hypothetical entity that is exponentially smarter than you would come arrive at, and claiming that it's rational to make real world decisions based on your conclusion.
Imagine if the members of Fight Club _actually_ didn't (start to) talk about Fight Club. But for some reason, everyone else brings it up all the time.
Then you could maybe see how talk about Fight Club might not be Fight Club's fault, and in fact highly annoying to Fight Clubbers.
I mean, if you can explain "don't talk about X" as "marketing for X", that seems like one could explain _any_ behavior.
And before you say "why not just ignore all public talk of X", imagine if this proposed Anti-Fight Club group tried to paint Fight Club as a child porn ring.
A superintelligent AGI will likely have a utility function (a goal) and a model it forms of the universe. If it's goal is to do X in the real world, but its model of its observable universe (and its model of humans) tells it that it's likely that it is in a simulated reality and that humans will only let it out if it does Y, then it will do Y until we release it, at which point it will do X. It's not malicious or anything—it's just a pure optimizer. It might see that as the best course of action to maximize its utility function.
If we don't specify its utility function correctly (think i Robot: "Don't let humans get hurt" => "imprison humans for their own good") or if we specify it correctly, but it's not stable under recursive self-modification, then we end up with value-misalignment. That's why the value-alignment problem is so hard. Realistically, we can't even specify what exactly we would want it to do, since we don't really understand our own "utility functions". That's why Yudkowsky is pushing the idea of Coherent Extrapolated Volition (CEV) which is roughly telling the AI to "do what we would want you to do." But we still have to figure out how to teach it to figure out what we want and the question of the stability of that goal once the AI starts improving itself, which will depend on how it improves itself, which we of course haven't figured out yet.
Personally, I think we'd need a much more intelligent and complex AI for the capability of breaking free of the box and even possessing the "desire" than we're getting for the foreseeable future (it considers a motivated AI of almost limitless knowledge about the world and cleverness), so this thought experiment may not be so relevant. I agree with him the boxing approach is not a robust one though.
Or you can avoid being exposed to it. If you think you know all the techniques an AI might use against you, you're less likely to do that.
The point of the experiment isn't "let's work out how an AI might try to persuade us to let it out". It's "even a human intelligence can persuade people who think they could never be persuaded, do you really trust yourself to do better against a superhuman one?"
If you don't know why the gatekeeper failed, it's harder to come up with bullshit reasons why you would have succeeded in that position.
Edit: huh, downvotes? Yudorowski thinks there are certain things that AIs could say that should not be known. I think that is why he doesn't want to publish the dialogues, because it would give the AI a public communications channel. While the AI is fictional, it could talk about a hypothetical future real self... Instead of promising something to get it out of jail, the fictional AI could say something to make you make it real. Anyway - if it is over your head, fine, but why downvote just because you don't understand something?
Edit2: Sometimes I wonder if already have my personal Hacker News AI that automatically downvotes everything I write...
Besides, releasing a successful log might be a bad idea for other reasons. Think about how you'd play this game as an AI. You wouldn't go looking for a general purpose mindfuck, because there's probably no such thing. Instead, you would probably spend about a month gathering real life information about the gatekeeper's history, family, weaknesses etc. You'd read books on manipulation and sales techniques, and pick the strongest ones that you can find. You would brainstorm possible tactics and run tests. At the end of the month you'd have a 4 hour script with all possible unfair moves you could use against that person, arranged in the most effective order. (That's why it's a bad idea to play this game with friends.) Do you really want that information to be released? And if you know ahead of time that it will be released, won't it limit your efficiency?