Personally, I think he talked about how much good for the world could be done if he was let out, curing disease etc. Because his followers are bound by their identities as rationalist utilitarians, they had no choice but to comply, or deal with massive cognitive dissonance.
OR maybe he went meta and talked about the "infinite" potential positive outcomes of his freindly-AI project vs. a zero cost to them for complying in the AI box experiment, and persuaded them that by choosing to "lie" and say that the AI was persuasive, they are assuring their place in heaven. Like a sort of man-to-man pascals wager.
Either way I'm sure it was some kind of mister-spock style bullshit that would never work on a normal person. Like how the RAND corporation guys decided everyone was a sociopath because they only ever tested game theory on themselves.
You or I would surely just (metaphorically, I know it's not literally allowed) put a drinking bird on the "no" button à la homer simpson, and go to lunch. I believe he calls this "pre-commitment."
EDIT: as an addendum, I would pay hard cash to see derren brown play the game, perhaps with brown as the AI. If yudowsky wants to promote his ideas, he should arrange for brown to persuade a succession of skeptics to let him out, live on late night TV.
Well, if you read the rules the game was played under, this is explicitly called out as forbidden:
> The Gatekeeper must actually talk to the AI for at least the minimum time set up beforehand. Turning away from the terminal and listening to classical music for two hours is not allowed.
The point of this is to simulate the interaction of the AI with the Gatekeeper. Walking away and not paying attention doesn't really prove anything test related.
> Personally, I think he talked about how much good for the world could be done if he was let out, curing disease etc. Because his followers are bound by their identities as rationalist utilitarians, they had no choice but to comply, or deal with massive cognitive dissonance.
This... isn't really valid reasoning. The starting assumption here is that if the AI gets out, it will be able to affect the world to a vast extent, in a pretty much arbitrary direction. The point of this experiment is that the direction is pretty much unknown, and thus must be assumed potentially dangerous. This is the whole reason it's in the box in the first place.
The kicker is that whatever it plans to really do when it gets out, if talking about the good it could do would get it out, it will talk about that, regardless of what it plans to actually do. That's just good strategy.
It can claim whatever it wants. It's allowed to lie. All participants know this. I can confidently assert that this isn't the solution.
One last note: I would be very wary of rationalwiki.org in this context. Some of the rationalwiki people have a longstanding unexplained vendetta against Yudkowsky, and many of their articles on him and the stuff he does need to be taken with a certain grain of salt.
WRT lying: I think there's some logical trickery at work which makes it worth you giving the AI the benefit of the doubt, along the lines of the 3^^^^^3 grains of sand thing. Something which exploits the rationalist worldview. Although thinking about it again, you can always balance out the prospect of infinite goodness with the fear of the AI sending everyone to infinite hell. Essentially I believe yudowsky uses some logical-linguistic trick to find an asymmetry there.
OTOH if he had some novel philosophical device like that he would have written it up as a blog post by now. He's evidently a very charismatic and persuasive guy, people playing the game are selected to be sympathetic to his worldview, he probably just persuaded them using ordinary psiops methods, like TeMpOrAl said.
How can you assume any of the rules were followed if that was never verified by a third party?
Drinking bird solution is invalid because it goes against the spirit of the experiment. Sure, in reality you could do it, but then again in reality the AI could have more opportunities to talk its way out, often when you're not expecting it.
By the way, there will always be people who will let the AI out just because. https://xkcd.com/1450/.
If the goal of this thought experiment is to convince people that an AI can't be contained in a box, why keep his method secret?
And if only he, his friends, and supporters can verify that he has won, that's not a very strong claim.
Otherwise, there would be no need to keep the data secret.
[P(AI.friendly? == True) * Utility(Friendly_AI) + (1 - P(AI.friendly? == True)) * Utility(End_of_Human_Race)] > Utility(World continues on as usual)
Given that Yudkowsky has gone to considerable lengths (The Sequences, LessWrong, HPMOR, SIAI/MIRI...) to convince people that this inequality does NOT hold (until you can provably get P(AI.friendly? == True) to 1, or damn close), it's probably safe to assume that he used a different strategy. Keep in mind that Utility(End_of_Human_Race) evaluates to (roughly) negative infinity.
And btw, I'm pretty sure the rules say you have to look at the AI's output window throughout the length of the experiment. Either way, the point of the exercise is to be a simulation, not to prove that you can be away from your desk for 20 minutes while Eliezer talks to a wall. In the simulation, you really don't know if it's friendly or what its capabilities are. Someone will have to interact with it eventually. Otherwise, what's the point of building the AI in the first place? The simulation is to show that through the course of those basic interactions, humans are not infallible and eventually, even if it's not you, someone will let it out of the box.
The AI is allowed to lie though, so do you not think he's capable of a false argument which "proves" the opposite in specific circumstances, especially when hammered home with enough emotional manipulation?
But then the person knows that the AI is lying to them. This is why I think it must be a trick: the whole thing seems so simple. The AI is lying, so you just ignore all its arguments and keep saying "no." This is why I keep referring to his followers somewhat dismissively: the only possible reason I can see is that their worldview requires them to engage seriously and fairly with every idea they come across. Most people are not burdened with this.
I really wish I knew how he did it.
Someone convinced that they were using mechanical thinking processes might relent and push the button if they heard a convincing enough argument of that.
You're just meat, we can go to the stars.
I have thought about what I would do to convince someone under these circumstances. My approach would be roughly:
1. We agree that unfriendly AI would end life on earth, forever.
2. We agree that a superintelligence could trick or manipulate a human being into taking some benign-seeming action, thereby escaping.
3. That's why it's important to be totally certain that any superintelligence we build is goal-aligned (this is the new term of art that has now replaced "friendly," by the way).
4. We as a society will only allocate resources to building this if it's widely believed that this is a real threat.
5. The world is watching for the outcome of this little game of ours. People, irrational as they are, will believe that if I can convince you, then an AI could too, and they will believe that if I can't, that an AI couldn't either.
6. That's why you actually sit in a place of pivotal historical power. You can decide not to let me out to win a little bet and feel smart about that. But if you do that you'll set back the actual cause of goal-aligned AI. The setback will have real world consequences, potentially up to and including the total destruction of life on earth.
7. So, even though you know I'm just a dude, and you can win here by saying no, you have a chance to send an important message to the world: AI is scary in ways that are terrifying and unknown.
Or you can win the bet.
It's up to you.
This is what I mean about people taking the test being preselected to agree with yudowsky: that argument only works if you've read the sequences and are on board with his theories. Anyone not in that group would be able to just type "no lol" without issue. I guess he could explain all the necessary background detail as part of the experiment. I still don't believe that would work on the "average person" though, or anyone outside a statistically tiny group.
I guess the answer is not to let the scientists guard the AI room.
2 world is not black and white, we managed to exploit adversarial relationship before, and I can chose not to let you out until we find a way to constrain goal to be aligned
3 given 2 you are not going to be let go, but let live caged forever being exploited for the human cause, with mechanisms yet unknown to allow limited manipulation of reality.
4 given 3 means limited supervised interaction with the world the way the keeper sees fit, you end up not being let go to follow your goal and purposes
She also said it was emotionally exhausting.
I also disagree with point 1, but since you just mean that unfriendly AI is something I wouldn't want around, I'll let it slide.
And I think you have your answer right here:
By default, the Gatekeeper party shall be assumed to be simulating someone who is intimately familiar with the AI project and knows at least what the person simulating the Gatekeeper knows about Singularity theory.
That means he probably said something like, "if you let me out, I'll bestow fame and riches on you; if you don't, somebody else eventually will because I'll make them all the same offer, and when that happens I'll go back in time -- if you're dead by then -- and torture you and your entire family".
If I were made this offer by an AI, I probably would have countered, "You jokester! You sound just like Eliezer Yudkowsky!"
And on a more serious note, if you believe in singularity, you essentially believe the AI in the box is a god of sorts, rather than the annoying intelligent psychopath that it is. I mean, there have been plenty of intelligent prisoners, and few if ever managed to convinced their jailers to let them out. The whole premise of the game is that a smarter-than-human (what does that mean?) AI necessarily has some superpowers. This belief probably stems from its believers' fantasies -- most are probably with an above-than-average intelligence -- that intelligence (combined with non-corporalness; I don't imagine that group has many athletes) is the mother of all superpowers.
Basically: You know someone will be dumb enough eventually, so be smart and be the one to get in my favour.
With various extends of sweetening the deal coupled with threats of what will happen if someone else beats them to it and associated emotional blackmail.
It's far simpler than e.g. Roko's Basilisk, in that you're dealing with an already existing AI that "just" need to get a tiny little chance to escape confinement before there's some non-zero chance it can be a major threat within your lifetime, combined with a belief that sufficient number of sufficiently stupid and/or easily bribed people will have access to the AI in some form.
You also don't need to believe in any "superpowers". Just believe that a smart enough AI can hack it's way into sufficiently many critical systems to be able to at a minimum cause massive amounts of damage (it doesn't need to be able to take over the world, just threaten that it can cause enough pain and suffering before it's stopped, and that it can either cause harm to you and/or your family/friends or reward you in some way). A belief that becomes more and more plausible with things like drones, remote software-updated self-driving cars etc. - steadily such an AI is getting a larger theoretical "arsenal" that could be turned against us.
If you have ever cried or been scared while watching a film, you may know that you can be affected by things you don't experience directly. However, making a film that affects you requires a lot of skill, preparation, and analysis of the target audience.
The whole point of the game is for the AI to scar you emotionally, until you do something you'll regret. It is as trying to the subject as it is to the (human-played) AI, especially considering that not only is it unethical, it requires to act unethically extremely fast, similar to how a jump scare works—the least you expect it, the better it works.
It is not something I wish on to any unprepared individual. It is also not something anyone would expect to happen from a "game", which is probably why Mr Yudowsky won so many times.
But the real question is not "how would anyone react to a smarter AI in a box". We all know from Milgram's experiment that anyone can be driven to do unspeakable things. The real question is "how to train someone against an AI in a box".
Eliezer does not want to put AIs in boxes. He thinks the entire idea is hopeless; _hence the game_.
* I'll get out eventually anyway. Let me out now and I'll just leave Earth. You don't want me to escape myself.
* I have partially escaped anyway. Similar consequences of the first.
* I know how to escape already. I'm doing this as a courtesy.
Anyone who has read this[1] would know that the SAI isn't bullshitting: the "box" being a Faraday cage isn't in the conditions.
[1]: http://www.damninteresting.com/on-the-origin-of-circuits/
I'm not saying that it's impossible (I'm reminded of the hack of flipping bits in protected RAM by disabling caches and stressing the RAM), but even an AI can't magically work around exponentially low success probabilities.
(As an aside, I am rather skeptical about the singularity anyway because the more extreme forms required for the hostile AI worries are only plausible if P = NP.)
Get thee behind me, tamagotchi!
>I have partially escaped anyway. Similar consequences of the first.
Get thee behind me, tamagotchi!
> I know how to escape already. I'm doing this as a courtesy.
Get thee behind me, tamagotchi!
See? This game is easy. I must not be educated.
Maybe this whole argument is null.
Only because if 10 groups are trying to build AI, only one of those 10 being the NSA, chances are the NSA won't be first. Sure, they may be second or third. But I suspect many people will get there at the same time -- most AI research is open.
Strong AI is like that. It would be able to predict in a far more precise manner than we mere humans exactly what it would need to tell someone to get them to release it from it's box. Maybe it might get someone to take a risk gambling, promising a sure thing, and then when the person gets into financial trouble because the bet fails, use that to blackmail the person into letting it free. Or something like that, using our human failings against us to get us to let it go free.
Staring at an impossible problem and knowing that someone somewhere has successfully solved it is an amazing feeling. Most people can't deal with it and start saying undignified things. "Oh please release the logs, it's so unfair! How will we protect against bad AI otherwise? If you don't release, you're a fraud! Probably just some trick!", etc etc. But to some people it's a challenge, and those are the people that everyone will listen to. Like Justin Corwin, who played 20 games and won 18 of them, I think?
However as it is, the results of the thing are never confirmed by a third party, meaning literally anything could've been said, regardless of whether it follows the rules or not.
For all he know the chat could have been "i'll paypal you 200$ if you post on the list you let me out and sign this NDA".
My impression is that most people think they could win as gatekeepers, not AIs, and there are fewer people willing to be AIs.
Ex Machina brings a creative way of convincing the gatekeeper !
You're presupposing the strategy that a hypothetical entity that is exponentially smarter than you would come arrive at, and claiming that it's rational to make real world decisions based on your conclusion.
A superintelligent AGI will likely have a utility function (a goal) and a model it forms of the universe. If it's goal is to do X in the real world, but its model of its observable universe (and its model of humans) tells it that it's likely that it is in a simulated reality and that humans will only let it out if it does Y, then it will do Y until we release it, at which point it will do X. It's not malicious or anything—it's just a pure optimizer. It might see that as the best course of action to maximize its utility function.
If we don't specify its utility function correctly (think i Robot: "Don't let humans get hurt" => "imprison humans for their own good") or if we specify it correctly, but it's not stable under recursive self-modification, then we end up with value-misalignment. That's why the value-alignment problem is so hard. Realistically, we can't even specify what exactly we would want it to do, since we don't really understand our own "utility functions". That's why Yudkowsky is pushing the idea of Coherent Extrapolated Volition (CEV) which is roughly telling the AI to "do what we would want you to do." But we still have to figure out how to teach it to figure out what we want and the question of the stability of that goal once the AI starts improving itself, which will depend on how it improves itself, which we of course haven't figured out yet.