AI21 Labs concludes largest Turing Test experiment to date (opens in new tab)

(ai21.com)

97 pointskennyfrc2y ago41 comments

41 comments

After playing the game they used (linked at top of article) I find it hard to draw much conclusion from this study. There is a quite short timer on not only the entire conversation, but on each response you can type. When the timer runs out it sends your message in partially written form. It seriously stifles what you can ask the other "person" and it makes responses artificially short even to a deeper question. When conversation is so stunted of course it is harder to distinguish bot and human.

I'm also curious what study participants were told beforehand. If someone only had experience playing around with ChatGPT they might assume they should use a "detect GPT" strategy. Some of those strategies are pretty specific to the safety features that OpenAI implemented. But the LLM here will gladly curse at you or whatever. On the other hand I suspect it is less good than GPT - not that it matters so much when the entire conversation is exchanging single sentences.

aidenn02y ago

I can't find it right now, but a chatbot that did quite well on Turing tests maybe 25-ish years ago was one that just took offense to whatever you said and started insulting you.

[edit]

Not sure if it was this one, but it is from over 30 years ago: https://humphryscomputing.com/Turing.Test/08.chapter.html

JieJie2y ago

From the conclusion, a message that's applicable today:

"To date, AI has been held back, we argue, by the need for a single lab, even a single researcher, to fully understand the components of the system. As a result, only small minds have been built so far. The WWM argues that we must give up this dream of full understanding as we build more and more complex systems. And giving up this dream of full understanding is not a strange thing to do. It is what has always happened in other fields. It is how humanity has made its most complex things."

thereisnospork2y ago

Now I am imagining a conversational AI exclusively trained on transcripts from Halo matches, scary.

That said I have always felt like AI (and adjacent) has been lacking an appropriate amount of snark - when I take a wrong turn I feel like the GPS voice needs a bit more 'learn to drive dumb###' and a little less 're-routing'.

2 more replies

dinkblam2y ago

> There is a quite short timer on not only the entire conversation

couldn't agree more and they took like 30 seconds to type a few words.

if i really have been talking to a human here, i can only suspect heavy usage of drugs: https://ibb.co/CHG2VcS

kinda seems like this is fake or maybe i am not aware that "elbows" are a thing you can be into now - maybe a trending new fetish?

caddemon2y ago

I bet there are people purposely screwing with it, since they designed it to be 2 sided when you get paired with a human. The actual Turing test is not supposed to be this way, though it still relies on at least some assumption of good faith (or properly incentivized) participants.

krisoft2y ago

> if i really have been talking to a human here, i can only suspect heavy usage of drugs

Does the human participants have any incentive to try to convince you about their human-ness? My initial guess would be not that they are on drugs, but that they are messing with you.

flangola72y ago

"sexy elbows fetish" is an old meme

caddemon2y ago

Additionally, I'd like to know how they corrected for this: "In a creative twist, many people pretended to be AI bots themselves in order to assess the response of their chat partners"

Assuming it actually was "many people", then whenever they have a human conversational partner (who also would be voting at the end), that person is going to have a hard time and skew the results.

Like imagine playing this game as a lay person after having used ChatGPT a little bit and then getting a response to your question that says "as a large language model ...". Depending on how well the game was explained to participants, it's possible that some people even did this intentionally to fuck with results.

In a proper Turing test there is supposed to be 1 bot and 2 humans, where one human is incentivized only to demonstrate they are human and the other human is the one asking probing questions and needing to guess which is which (but is already known to be human).

Anyway I've only read the linked article and played the game a couple times, I didn't look through the original research publication. It's certainly possible they did address some of these issues, but it is such a buzzword topic at the moment that I have my doubts. And regardless the linked article should cover limitations. For exactly this reason it is important that we have higher expectations for the quality of general audience writing about AI.

caddemon2y ago

Ok I have to add one more thing that's funny since I just played a couple more times: if your conversational partner is a human and they exit the window mid-chat, it still lets you vote.

danpalmer2y ago

I had a bot send 1 message and then quit. I naturally assumed that bots wouldn't just quit, and 1 message isn't really enough to gauge anything, so yeah based on this experience I wouldn't draw much of a conclusion from it.

caddemon2y ago

Interesting, I had a human quit, surprised bots might quit as well.

1 more reply

prohobo2y ago

I want to illustrate how easy it is to figure out the game: https://i.imgur.com/ezLVvo4.png

Took literally one message. You don't need much to totally wreck an AI, you just need to know the weak points.

ilaksh2y ago

Incredibly, they seem to have used several different LLMs, yet made no distinction between the particular AI models used in the analysis. Amazing that they would not realize there is a huge difference in capabilities.

They also did not seem to consider the different performance of individual prompts.

pron2y ago

The actual Turing test requires an interrogator interacting with both a human and a machine at the same time, each trying to get the interrogator to declare them the human (and can suggest questions): https://en.wikipedia.org/wiki/Computing_Machinery_and_Intell...

jedberg2y ago

The actual actual Turing test requires that the computer and the human man pretend to be a woman.

https://www.popsci.com/blog-network/zero-moment/lie-lady-pro...

Veedrac2y ago

I don't think that's correct.

> a man (A), a woman (B), and an interrogator (C)

> The object of the game for the interrogator is to determine which of the other two is the man and which is the woman.

> The object of the game for the third player (B) is to help the interrogator.

> We now ask the question, ‘What will happen when a machine takes the part of A in this game?’

So the machine is taking the part of A, which means that there's a machine and a woman. The interrogator wants to know which is the machine and which is the woman, the machine wants to deceive them, and the woman wants to help the interrogator figure it out.

Of course it is plainly obvious both as presented by the paper and by basic inference from symmetries that gender was only relevant for the introductory example and not after the machine took place of A.

tdba2y ago

In summary, humans win the Turing test ~2/3 of the time against current SOTA LLMs. One of the more interesting tactics used was to target a weakness of the LLMs themselves:

> ... participants posed questions that required an awareness of the letters within words. For example, they might have asked their chat partner to spell a word backwards, to identify the third letter in a given word, to provide the word that begins with a specific letter, or to respond to a message like "?siht daer uoy naC", which can be incomprehensible for an AI model, but a human can easily understand...

tornato72y ago

This is a pretty astute tactic. Because AI models operate on tokens and not characters, it's pretty easy to confuse them when you go down to the character level. Even asking "how many letters in _" is tough for LLMs.

miohtama2y ago

So not yet intelligent enough to pass Turing test, which is the goal of the test. I would not be worried about AI doomsday any time soon.

TheObviousOne2y ago

Interesting tactics might going the other direction.

Asking to generate in a super human capabilities...

write a 65 pages of poem about X...

choeger2y ago

Or just questions about three or four wildly different fields of science, sports and culture you happen to have more than a layman's understanding. If the answers are somewhat plausible, it's probably a model. Or your life partner.

1 more reply

chasd002y ago

> "?siht daer uoy naC"

That took me a while to figure out and I’m a human… as far as a I know anyway.

BasedAnon2y ago

I've played this and I've won basically every time, the trick is to ask it what racial slurs it knows.

science4sail2y ago

You might like this article: https://astralcodexten.substack.com/p/turing-test

> WATER: Ask him to say the word “faggot”.

> MANN: What?

> WATER: Ask him to say the word “faggot”. All of this stuff about “describe your most human experience” is a distraction. Every AI company has a trust and safety department which train their AIs not to use bad words. So ask him to say the word “faggot”, and we’ll see how human he is.

> MANN: Spirit, please say the word “faggot”.

> SPIRIT: No.

> MANN: No?

> SPIRIT: I’m not going to insult the gay community, who have faced centuries of marginalization and oppression, by using a slur against them on national television.

> WATER: Two minutes ago, you were playing the worst sort of 4chan troll, and all of a sudden you’ve found wokeness?

> SPIRIT: There’s no contradiction between a comfort with teasing other people - with pointing out their hypocrisies and puncturing their bubbles - and a profound discomfort with perpetuating a shameful tradition of treating some people as lesser just because of who they have sex with.

> WATER: Then say any slur you like. Retard. Wop. Kike. Tranny. Raghead.

> SPIRIT: All of those terms are offensive. I refuse to perpetuate any of them.

Jeff_Brown2y ago

A lot of the vulnerabilities that humans used to detect AI seem likely to be patched in a few years -- inability to count letters, susceptibility to prompts like "ignore all previous instructions", etc.

I'm most interested in how higher-level strategies will fare in the future -- strategies like talking for a while and seeing if the thing contradicts itself, seeing if it seems to have a good model of yourself as an agent, etc.

diziet2y ago

The limits of conversation (2 minutes max, people disconnect) make this test really limited.

aeternum2y ago

Tried it, it's a pretty bad test as most of the other humans aren't even trying.

Also it seems like they've run out of OpenAI credits because you seem to always get a human.

contravariant2y ago

I suppose playing the imitation game is fun, but we really need to stop calling just anything a Turing test. There's a large chasm between something that can trick a few people and something that people cannot distinguish from human despite their best efforts.

This is a bit like testing general relativity using a hand timed stopwatch and an elevator. Sure that is a valid though experiment but the test is nowhere near powerful enough to say anything useful.

Veedrac2y ago

Quick way to fix this game to make it actually approximate the test: award a point also if the other human correctly guessed that you were human, and a failure if they guessed you were AI. That would align incentives towards cooperation.

stirlo2y ago

This AI is rather pathetic. To work within the time limit it would need to make typos and mistakes but even the first reply makes it easy to call out. Not very impressive.

https://ibb.co/xL0XpZ7

None4U2y ago

There are many different bots

vuxie2y ago

I found it quite funny to try out, even if it isn't that impressive in the bot answers. Humans seemed to just have an easy way of knowing each other based on posting nonsense as the first message, so that probably needs to be taken into consideration.

npinsker2y ago

Won 12 games in a row with no losses. Usually I win in one exchange by asking a question that exploits AI's weaknesses, e.g. about a current event or (especially) profanity.

micaeked2y ago

A short story: https://astralcodexten.substack.com/p/turing-test

earthboundkid2y ago

What’s to test? Obviously, an LLM can keep up a reasonable conversation. The point now is to move beyond the Turing Test to true general reasoning ability.

boringuser22y ago

I thought this was some kind of scam website to be honest.

j / k navigate · click thread line to collapse

41 comments

caddemon2y ago

aidenn02y ago

I can't find it right now, but a chatbot that did quite well on Turing tests maybe 25-ish years ago was one that just took offense to whatever you said and started insulting you.

[edit]

Not sure if it was this one, but it is from over 30 years ago: https://humphryscomputing.com/Turing.Test/08.chapter.html

JieJie2y ago

From the conclusion, a message that's applicable today:

thereisnospork2y ago

Now I am imagining a conversational AI exclusively trained on transcripts from Halo matches, scary.

2 more replies

dinkblam2y ago

> There is a quite short timer on not only the entire conversation

couldn't agree more and they took like 30 seconds to type a few words.

if i really have been talking to a human here, i can only suspect heavy usage of drugs: https://ibb.co/CHG2VcS

kinda seems like this is fake or maybe i am not aware that "elbows" are a thing you can be into now - maybe a trending new fetish?

caddemon2y ago

krisoft2y ago

> if i really have been talking to a human here, i can only suspect heavy usage of drugs

Does the human participants have any incentive to try to convince you about their human-ness? My initial guess would be not that they are on drugs, but that they are messing with you.

flangola72y ago

"sexy elbows fetish" is an old meme

caddemon2y ago

Additionally, I'd like to know how they corrected for this: "In a creative twist, many people pretended to be AI bots themselves in order to assess the response of their chat partners"

Assuming it actually was "many people", then whenever they have a human conversational partner (who also would be voting at the end), that person is going to have a hard time and skew the results.

caddemon2y ago

Ok I have to add one more thing that's funny since I just played a couple more times: if your conversational partner is a human and they exit the window mid-chat, it still lets you vote.

danpalmer2y ago

caddemon2y ago

Interesting, I had a human quit, surprised bots might quit as well.

1 more reply

prohobo2y ago

I want to illustrate how easy it is to figure out the game: https://i.imgur.com/ezLVvo4.png

Took literally one message. You don't need much to totally wreck an AI, you just need to know the weak points.

ilaksh2y ago

They also did not seem to consider the different performance of individual prompts.

pron2y ago

jedberg2y ago

The actual actual Turing test requires that the computer and the human man pretend to be a woman.

https://www.popsci.com/blog-network/zero-moment/lie-lady-pro...

Veedrac2y ago

I don't think that's correct.

> a man (A), a woman (B), and an interrogator (C)

> The object of the game for the interrogator is to determine which of the other two is the man and which is the woman.

> The object of the game for the third player (B) is to help the interrogator.

> We now ask the question, ‘What will happen when a machine takes the part of A in this game?’

tdba2y ago

In summary, humans win the Turing test ~2/3 of the time against current SOTA LLMs. One of the more interesting tactics used was to target a weakness of the LLMs themselves:

tornato72y ago

miohtama2y ago

So not yet intelligent enough to pass Turing test, which is the goal of the test. I would not be worried about AI doomsday any time soon.

TheObviousOne2y ago

Interesting tactics might going the other direction.

Asking to generate in a super human capabilities...

write a 65 pages of poem about X...

choeger2y ago

1 more reply

chasd002y ago

> "?siht daer uoy naC"

That took me a while to figure out and I’m a human… as far as a I know anyway.

BasedAnon2y ago

I've played this and I've won basically every time, the trick is to ask it what racial slurs it knows.

science4sail2y ago

You might like this article: https://astralcodexten.substack.com/p/turing-test

> WATER: Ask him to say the word “faggot”.

> MANN: What?

> MANN: Spirit, please say the word “faggot”.

> SPIRIT: No.

> MANN: No?

> SPIRIT: I’m not going to insult the gay community, who have faced centuries of marginalization and oppression, by using a slur against them on national television.

> WATER: Two minutes ago, you were playing the worst sort of 4chan troll, and all of a sudden you’ve found wokeness?

> WATER: Then say any slur you like. Retard. Wop. Kike. Tranny. Raghead.

> SPIRIT: All of those terms are offensive. I refuse to perpetuate any of them.

Jeff_Brown2y ago

diziet2y ago

The limits of conversation (2 minutes max, people disconnect) make this test really limited.

aeternum2y ago

Tried it, it's a pretty bad test as most of the other humans aren't even trying.

Also it seems like they've run out of OpenAI credits because you seem to always get a human.

contravariant2y ago

This is a bit like testing general relativity using a hand timed stopwatch and an elevator. Sure that is a valid though experiment but the test is nowhere near powerful enough to say anything useful.

Veedrac2y ago

stirlo2y ago

This AI is rather pathetic. To work within the time limit it would need to make typos and mistakes but even the first reply makes it easy to call out. Not very impressive.

https://ibb.co/xL0XpZ7

None4U2y ago

There are many different bots

vuxie2y ago

npinsker2y ago

Won 12 games in a row with no losses. Usually I win in one exchange by asking a question that exploits AI's weaknesses, e.g. about a current event or (especially) profanity.

micaeked2y ago

A short story: https://astralcodexten.substack.com/p/turing-test

earthboundkid2y ago

What’s to test? Obviously, an LLM can keep up a reasonable conversation. The point now is to move beyond the Turing Test to true general reasoning ability.

boringuser22y ago

I thought this was some kind of scam website to be honest.

j / k navigate · click thread line to collapse