(Unsuccessfully) Fine-tuning GPT to play "Connections" (opens in new tab)

(danielcorin.com)

95 pointsdanielcorin2y ago50 comments

50 comments

46 comments · 16 top-level

xnorswap2y ago· 12 in thread

This game is well known in the UK as the "Connecting Wall" from Only Connect.

This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well. Perhaps it would need a model to be specifically trained for this task. If alpha-zero can master chess, then surely this game isn't beyond what is trainable.

You can prompt Chat GPT that it'll be playing the connecting wall without having to explain the game. It still fails to make a good set of connections when provided the wall.

One interesting part of the "Connecting Wall" sets is that there is almost always a "Wordy one" involving changing a letter, adding a prefix, anagrams, etc. Almost always a "Person" one for example there'll be a set of "Famous people named Tom..." but not a set of "Toms" with a set of "Margarets", and then a couple of general sets.

This is a huge help given the 2 minutes and 30 seconds provided.

On another note, it's possible that the GCHQ puzzle book would be in the training set, which has many puzzles with solutions for this format and a very similar rubrik with 55 items and sets of sizes 1 through 10. That said, Chat GPT perhaps would not tie the answers in the back of the book to the solutions in the front.

I all, I think an AI trained for this purpose with problems and given solutions ought to end up mastering this format. But a general purpose chat GPT seems like it performs very badly.

jw12242y ago

> This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well

I would speculate it’s struggling because of the linear nature of its output, and the red-herring words which crossover between categories.

Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.

I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.

coolness2y ago

> I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.

I had a similar idea to the author and tried this many times, albeit with the free version of ChatGPT. After getting wrong results, I prompted it to correct them, even telling the model explicitly that a category is wrong or doesn't make sense. Nothing I did made a difference.

My two cents on why this doesn't work has to do with the fact that the answer should contain a discrete set of words given in the prompt, and importantly, they should not be duplicated. I suspect that these currents models are not very good at following the instruction "the token should appear in the answer exactly once"

nigamanth2y ago

> Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.

Aren't there already models that CAN look ahead? Or are there none?

mtlmtlmtlmtl2y ago

Not sure how Alpha Zero is relevant to whether a transformer can play connections. Alpha zero is not a transformer and chess is not connections.

monsieurbanana2y ago

Néophyte question:

Can we infer anything about what llm's can achieve from what we can achieve with AIs like AlphaGo? I thought their approaches were completely separated

themoonisachees2y ago

Not really;

Gpts are a class of text predictors. Ultimately they are ranked on whether or not the output is similar to the training data, text-wise. If the training data included a game then it may be able to play that game, but only if that game requires reasoning about entire words (because of tokenization, gpts can't reason in terms of letters, that's why they do poorly at crosswords for example)

On the flip side, alphazero is a class of networks that have a list of actions they can take, and a list of parameters they observe about the game (in chess: the board position, in other games: their position on screen, score, speed, etc). The model is then trained to take actions that maximize an actual hard value from the game, like winning a game of chess, capturing a piece, increasing a score, driving the furthest.

In theory you could train a model with the alphago method to do text prediction, but LLMs are called "large" for a reason, the input and output spaces would have to be the number of possible tokens (and at that point just train a normal gpt, it's much more efficient). Also in theory you could train a gpt to play games, but you're spending huge amounts of compute evaluating extraneous words in the input (the prompt) and the output (most words do not have anything to do with your game). on top of that, you're iterating over every word you generate to generate the next one, so you're doing multiple passes of this largely infficient computing, which means you're slower compared to a tailor-made model that can evaluate one situation once and give you a list of outputs to perform.

in this specific case it's a bit wierd because the input space for the alphazero model would have to be every word that can appear on the board, but the reasoning part is most likely not a problem given enough model size. since it's competing with a multi-gigabyte llm though, there is space to spare.

nsagent2y ago

I've certainly thought about testing LLMs on Connections and I'm glad someone has. It might be possible to increase their performance, but LLMs as-is are not suited for the task.

The problem is that Connections is ultimately a search problem that requires more than simply grouping similar words. There are lots of combinations to assess. I bet if you enumerate, score, then rank all possible groupings, an LLM would perform much better.

firebaze2y ago

ChatGPT4 solved today's riddle in the first try for me. Caution, spoilers ahead: https://chat.openai.com/share/0c40a0b5-ab8f-4094-a7cc-21bb94...

(it even ignored some embarrassing typos ...)

mbb702y ago

Doesn't this list the words in the order that they are grouped? The article states that randomizing the words completely eliminates any successful results

jph002y ago

It didn't solve it -- instead it simply created groups in the exact order you provided.

jkrems2y ago

Apart from the "it just explained the already ordered groups in the question" problem, it didn't even explain one of the groups correctly. "Something about coat(ing) and food" is not the correct explanation, it's missing a lateral logic step there to go from food-related to a separate meaning.

archgoon2y ago

Chess has a well defined set of correct solutions. The rules are well known and understood.

Connections is much less so.

benpacker2y ago· 5 in thread

I unfortunately can’t imagine having time to test this, but I imagine there may be a way to accomplish this with embeddings.

The game itself is sort of an embeddings clustering problem, with the added difficulty that each group needs to only be alike in 1 way (versus a full vector distance which measures how alike they are in every way).

Maybe there is some way to search for a vector of weights, which, when multiplied by all members of a group of 4, produces weighted vectors with the least distance from their center? And then it’s an optimization problem to find the 4 groups that find minimize the total distance from each groups center?

It may be possible to find a weight vector that selects for a particular slice of a words meaning

noxvilleza2y ago

That approach works well for a game like [Codewords](https://en.wikipedia.org/wiki/Codenames_(board_game)) where you're trying to find a single-word common hint between many of your words (that doesn't hit any of the other words).

My feeling is that it'll struggle with word-plays in OnlyConnect/Connections (like missing letters, added letters, words-within-words homophones, etc) as well as two-step references (such as {Venice, Dream, Night, Nothing} => "last words of Shakespeare plays"}).

furyofantares2y ago

Does it?

I thought it would. But I've spent a fair bit of effort both using embeddings and also using prompts to GPT4, as well as combinations of the two approaches, to try to make a good spymaster for Codenames with essentially zero success.

I wonder if something like https://wordassociations.net/en might be better for it than embeddings.

1 more reply

benpacker2y ago

Ah wow, I'm not a frequent player so I didn't know how clever it can get!

Yenrabbit2y ago

I tried with clustering similar embeddings but it did extremely poorly (~0%) since the groupings are often deceiving with words in a group only having one small way in which they're connected and lots of spurious fake groups to throw you off. Maybe looking for groups with high similarity on only a sibset of embedding dimensions might help, but I didn't have much time to play either :) A notebook to get you going if you do want to play: https://colab.research.google.com/drive/1KJeSB9Q5XzSeT9ONUJ_...

benpacker2y ago

I definitely think trying to find similarity on a variable subset of dimensions is required. Fingers crossed I get the time to try soon

RheingoldRiver2y ago· 5 in thread

This game looks cool but wow the UX is terrible. Why can't you click & drag the words to reorder them? Seems half the difficulty is keeping track of your thought process with the inability to make a draft state.

matsemann2y ago

Also, there is no reason to limit the amount of guesses. Just let me try until I figure it out. But no, they've put a limit so that it can be "sharable" in a tweet-sized text, to try to copy the viralness of Wordle. But they do it to the detriment of the gameplay, in such a way that I don't even bother playing.

n2d42y ago

Disagreed, the limit is what gives the game a constraint and makes it interesting IMHO. I like to have something that makes me fail because I care less about optimizing a score, more about beating it in the first place. Different people play games differently, etc.

I also don't see how it makes it "sharable". Wouldn't it be more sharable if they let everyone win and just give them a score?

krainboltgreene2y ago

The whole point of the game is to do it within a bounded set of moves.

1 more reply

aidos2y ago

The format in the show it’s lifted from (Only Connect - greatest game show ever) is that the teams have 2 minutes total to solve the “connecting wall”. They can have as many guesses as they want until they solve the first two groups - after that it’s 3 strikes and you’re out.

pmelendez2y ago

>Why can't you click & drag the words to reorder them? That level of difficulty is part of the game. There is a shuffle button to ease the ideas generation but most likely is was done like that by design.

epiccoleman2y ago· 2 in thread

This is pretty interesting. Intuitively, Connections is the kind of thing I would expect GPT to not be good at, because almost every day there's something that feels kind of "out of left field" in the categories. In my experience LLMs are good at regurgitating the "standard" take on a topic, or "best practices", but lack the creativity and out-of-the-box thinking that makes Connections fun.

On the other hand, it feels like the kind of thing where an LLM might be surprisingly good, because it could, in theory, be able to see more correlations than a human can. Based on these results I guess my intuition seems to hold up.

I wonder if a better / different way to approach this could be more "algorithmically" - maybe have the LLM generate a list of possible categories for each individual word and then try to operate on those associations?

Cool article!

matsemann2y ago

The "whole point" of embeddings is that words have a vector that represents how well that word fits into a certain categories, so words belonging together is close in that vector space. So in that sense it almost feels like this should be solvable using something simpler than a full LLM. To "just" get the embeddings of the words, and then find the groups of 4 that minimizes the total distances within the groups.

basil-rash2y ago

The problem is Connections is designed to use a tons of alternate definitions and other vaguities that aren’t well modeled in typical embeddings. Today’s for instance (spoilers!!) has Coat, Green, Pod, and Soup as being linked for them matching “Pea ___”. No embedding would relate them at all, unless that suffix is known a priori.

binarymax2y ago· 2 in thread

You need to model how a person actually plays connections. Start with the most obvious group that has the least ambiguity, and then your problem space is smaller on category 2, then the same for category 3 and 4.

So really you could fine tune 3 models - one for 16 words, one for 12, and one for 8. Then use them in succession.

Also, if you come across a mistake at the end (have some negative examples in the training sets), tell it to start over and add to the prompt what you think is NOT a group.

thomasahle2y ago

It might even be easier to pick an arbitrary word, and ask it to find the three that matches it.

Asking GPT to just pick any group, adds a lot of extra "mental overhead".

Though of course this works best if all the groups are roughly of the same difficulty.

SamBam2y ago

Connections is deliberately written so that any one word might belong to multiple groups. For example, the word "Bass" might be surrounded by "Guitar," "Drums," and "Microphone," but actually belongs to a category of "Fish," while "Guitar" might belong to the category "Air ___," and "Microphone" might belong to "Something that can be dropped."

Just making up that example, but it's very common that multiple words will all appear to be one group, and actually each one belongs to a different group.

ilaksh2y ago· 1 in thread

I have a couple of ideas.

1. Have it do a thinking/brainstorming phase first to try to work out what the potential categories are.

2. Then ask it to scan over each word and think about what categories it could go in, in order of likelihood.

3. Ask it to do the final answer.

Format the training set in that way, as if it got everything right at each step (since you only have the right answers).

It sounds like you had 7 * 30 = around 200 examples. Maybe you can feed a batch of ten at a time and explain the game and try to get GPT-4 to generate more examples. You will have to see if they make sense.

I assume that by increasing the size of the dataset by a factor of ten, and having the LLM think through the problem using multiple steps, you will get significantly better results.

SamBam2y ago

I spent about 30 minutes with GPT4 and tried lots of variations of pre-processing. I had it first list large numbers of possible categories, then try to consider one category with four words, then double-check that category and look at the remaining words, then go ahead with the next....

No matter how I instructed it to think, it frequently could not work out the very first category.

charcircuit2y ago· 1 in thread

It's a shame you can't just see the probability distributions for the 16 words and choose them yourself that way you never hallucinate a word and the groups are always 4 words long.

coolness2y ago

This is a great idea actually, this way you could also enforce that all the words appear _exactly once_ (by rejecting during sampling words that were already in the answer) which seemed to be a significant issue for me when I tried this.

daemonologist2y ago· 1 in thread

I suspect the inability of the model to "plan ahead" is a significant contributor to its poor performance relative to a human. Being able to check a grouping to be sure it includes at least four words _and_ to check that it doesn't conflict with the other three groupings is a major advantage - it's pretty common that these puzzles include partial or incompatible red herring groups.

If this is the case, performance might be improved by taking the final solving responsibility away from the model and giving it to the script. You could ask GPT for categories, ask whether each word fits each category (discarding categories with fewer than 4 words), and then search for 4 non-overlapping categories.

(This might be missing the point of the exercise though.)

pomatic2y ago

Would step-wise instruction help with the look ahead issue. Something like: 1. Here are 16 words. Find 4 that have something in common, and list the remaining 12 words. 2. Take the remaining 12 words from the previous answer and find 4 words that have something in common, and list the remaining 8 words. etc. etc.

selimthegrim2y ago· 1 in thread

For a second I thought James bloody Burke was going to enter the chat

esafak2y ago

What would be bad about that? His Connections documentary was great. And apparently rebooted last year: https://curiositystream.com/video/6468

amayne2y ago

You’ll probably get better results by putting the examples only in the completion part of the training examples.

GPT-3.5 learns how to generalize better when it’s just in the completion.

This is the same problem that vexed the researchers who did the paper on the alleged reversal curse.

(https://andrewmayne.com/2023/11/14/is-the-reversal-curse-rea...)

Yenrabbit2y ago

FWIW I was able to get about 20% accuracy (perfect 4/4 groups) with ~50% of groups correct on average and most mistakes being groups with 3/4 right (so at least on the right track) with my first attempt at 0-shot prompting. The prompt goes something like this: ``` You are playing [game info] with [word list] Follow these steps: - consider possible groupings as initial brainstorming (>4 groups) - propose a first hypothesis based on the likely-looking groups - reflect on whether that grouping works - revise if needed then submit the final predictions ``` Having it start with word one of group one as the first output token seems unlikely to work from my intuition about what these models can do. Heck, I can't solve it that way! Burning some tokens on exploration and hypothesis building leaves it with the easier task of choosing plausible groups from the proposed options. System 2 thinking vs system 1 perhaps.

meatmanek2y ago

I've played around with the same problem, though I didn't do any fine-tuning. Some strategies that seemed promising:

  - A two-pass approach where you prompt it to generate the groupings, then separately prompt it to find the words that belong into each group. (Which of the following words best fit the category "COMMON DOG NAMES"?). It does way better at the more specific queries.
  - Don't tell it the constraints of 4 groups of 4 words; ask it for at least four groups of 2 or more words. Once you have 4+ groups of 4+ words, you can make logical inferences with your Python wrapper to come up with all the possible 4x4 groupings. If you're lucky there will only be one. If not... more queries to GPT, I guess, but I haven't figured this part out.

orzig2y ago

Love a published null result!

jappgar2y ago

I've found that for tasks like this you need to ask the model to output its reasoning prior to outputting the "answer".

See "chain of thought" e.g

https://arxiv.org/abs/2201.11903

s1mon2y ago

I spent a bunch of time manually just using GPT4 with fairly simple prompts and giving it the same feedback that the game gives. There's an archive of puzzles which I used to try to train it with, and sometimes it would be very successful, and sometimes it was frustrating how bad it was at doing basic things like keeping track of what words it had used so far. Each day I would also have it play the new puzzle from the NYTimes which it couldn't have trained on. Some days it did perfectly some it made really stupid mistakes. It seems like a more concerted effort could achieve better results.

ehaveman2y ago

fun. good writeup. i tried setting up a custom gpt to run a game like anagramish.com - gave it the word list to choose from, instructions on the rules, etc - but no matter what i did in the prompt it would hallucinate start words or it would incorrectly accept invalid guesses (or mark correct guesses as invalid).

j / k navigate · click thread line to collapse

50 comments

46 comments · 16 top-level

xnorswap2y ago· 12 in thread

This game is well known in the UK as the "Connecting Wall" from Only Connect.

You can prompt Chat GPT that it'll be playing the connecting wall without having to explain the game. It still fails to make a good set of connections when provided the wall.

This is a huge help given the 2 minutes and 30 seconds provided.

I all, I think an AI trained for this purpose with problems and given solutions ought to end up mastering this format. But a general purpose chat GPT seems like it performs very badly.

jw12242y ago

> This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well

I would speculate it’s struggling because of the linear nature of its output, and the red-herring words which crossover between categories.

I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.

coolness2y ago

> I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.

nigamanth2y ago

Aren't there already models that CAN look ahead? Or are there none?

mtlmtlmtlmtl2y ago

Not sure how Alpha Zero is relevant to whether a transformer can play connections. Alpha zero is not a transformer and chess is not connections.

monsieurbanana2y ago

Néophyte question:

Can we infer anything about what llm's can achieve from what we can achieve with AIs like AlphaGo? I thought their approaches were completely separated

themoonisachees2y ago

Not really;

nsagent2y ago

I've certainly thought about testing LLMs on Connections and I'm glad someone has. It might be possible to increase their performance, but LLMs as-is are not suited for the task.

firebaze2y ago

ChatGPT4 solved today's riddle in the first try for me. Caution, spoilers ahead: https://chat.openai.com/share/0c40a0b5-ab8f-4094-a7cc-21bb94...

(it even ignored some embarrassing typos ...)

mbb702y ago

Doesn't this list the words in the order that they are grouped? The article states that randomizing the words completely eliminates any successful results

jph002y ago

It didn't solve it -- instead it simply created groups in the exact order you provided.

jkrems2y ago

archgoon2y ago

Chess has a well defined set of correct solutions. The rules are well known and understood.

Connections is much less so.

benpacker2y ago· 5 in thread

I unfortunately can’t imagine having time to test this, but I imagine there may be a way to accomplish this with embeddings.

It may be possible to find a weight vector that selects for a particular slice of a words meaning

noxvilleza2y ago

furyofantares2y ago

Does it?

I wonder if something like https://wordassociations.net/en might be better for it than embeddings.

1 more reply

benpacker2y ago

Ah wow, I'm not a frequent player so I didn't know how clever it can get!

Yenrabbit2y ago

benpacker2y ago

I definitely think trying to find similarity on a variable subset of dimensions is required. Fingers crossed I get the time to try soon

RheingoldRiver2y ago· 5 in thread

matsemann2y ago

n2d42y ago

I also don't see how it makes it "sharable". Wouldn't it be more sharable if they let everyone win and just give them a score?

krainboltgreene2y ago

The whole point of the game is to do it within a bounded set of moves.

1 more reply

aidos2y ago

pmelendez2y ago

epiccoleman2y ago· 2 in thread

Cool article!

matsemann2y ago

basil-rash2y ago

binarymax2y ago· 2 in thread

So really you could fine tune 3 models - one for 16 words, one for 12, and one for 8. Then use them in succession.

Also, if you come across a mistake at the end (have some negative examples in the training sets), tell it to start over and add to the prompt what you think is NOT a group.

thomasahle2y ago

It might even be easier to pick an arbitrary word, and ask it to find the three that matches it.

Asking GPT to just pick any group, adds a lot of extra "mental overhead".

Though of course this works best if all the groups are roughly of the same difficulty.

SamBam2y ago

Just making up that example, but it's very common that multiple words will all appear to be one group, and actually each one belongs to a different group.

ilaksh2y ago· 1 in thread

I have a couple of ideas.

1. Have it do a thinking/brainstorming phase first to try to work out what the potential categories are.

2. Then ask it to scan over each word and think about what categories it could go in, in order of likelihood.

3. Ask it to do the final answer.

Format the training set in that way, as if it got everything right at each step (since you only have the right answers).

I assume that by increasing the size of the dataset by a factor of ten, and having the LLM think through the problem using multiple steps, you will get significantly better results.

SamBam2y ago

No matter how I instructed it to think, it frequently could not work out the very first category.

charcircuit2y ago· 1 in thread

It's a shame you can't just see the probability distributions for the 16 words and choose them yourself that way you never hallucinate a word and the groups are always 4 words long.

coolness2y ago

daemonologist2y ago· 1 in thread

(This might be missing the point of the exercise though.)

pomatic2y ago

selimthegrim2y ago· 1 in thread

For a second I thought James bloody Burke was going to enter the chat

esafak2y ago

What would be bad about that? His Connections documentary was great. And apparently rebooted last year: https://curiositystream.com/video/6468

amayne2y ago

You’ll probably get better results by putting the examples only in the completion part of the training examples.

GPT-3.5 learns how to generalize better when it’s just in the completion.

This is the same problem that vexed the researchers who did the paper on the alleged reversal curse.

(https://andrewmayne.com/2023/11/14/is-the-reversal-curse-rea...)

Yenrabbit2y ago

meatmanek2y ago

I've played around with the same problem, though I didn't do any fine-tuning. Some strategies that seemed promising:

  - A two-pass approach where you prompt it to generate the groupings, then separately prompt it to find the words that belong into each group. (Which of the following words best fit the category "COMMON DOG NAMES"?). It does way better at the more specific queries.
  - Don't tell it the constraints of 4 groups of 4 words; ask it for at least four groups of 2 or more words. Once you have 4+ groups of 4+ words, you can make logical inferences with your Python wrapper to come up with all the possible 4x4 groupings. If you're lucky there will only be one. If not... more queries to GPT, I guess, but I haven't figured this part out.

orzig2y ago

Love a published null result!

jappgar2y ago

I've found that for tasks like this you need to ask the model to output its reasoning prior to outputting the "answer".

See "chain of thought" e.g

https://arxiv.org/abs/2201.11903

s1mon2y ago

ehaveman2y ago

j / k navigate · click thread line to collapse