This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well. Perhaps it would need a model to be specifically trained for this task. If alpha-zero can master chess, then surely this game isn't beyond what is trainable.
You can prompt Chat GPT that it'll be playing the connecting wall without having to explain the game. It still fails to make a good set of connections when provided the wall.
One interesting part of the "Connecting Wall" sets is that there is almost always a "Wordy one" involving changing a letter, adding a prefix, anagrams, etc. Almost always a "Person" one for example there'll be a set of "Famous people named Tom..." but not a set of "Toms" with a set of "Margarets", and then a couple of general sets.
This is a huge help given the 2 minutes and 30 seconds provided.
On another note, it's possible that the GCHQ puzzle book would be in the training set, which has many puzzles with solutions for this format and a very similar rubrik with 55 items and sets of sizes 1 through 10. That said, Chat GPT perhaps would not tie the answers in the back of the book to the solutions in the front.
I all, I think an AI trained for this purpose with problems and given solutions ought to end up mastering this format. But a general purpose chat GPT seems like it performs very badly.
I would speculate it’s struggling because of the linear nature of its output, and the red-herring words which crossover between categories.
Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.
I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.
I had a similar idea to the author and tried this many times, albeit with the free version of ChatGPT. After getting wrong results, I prompted it to correct them, even telling the model explicitly that a category is wrong or doesn't make sense. Nothing I did made a difference.
My two cents on why this doesn't work has to do with the fact that the answer should contain a discrete set of words given in the prompt, and importantly, they should not be duplicated. I suspect that these currents models are not very good at following the instruction "the token should appear in the answer exactly once"
Aren't there already models that CAN look ahead? Or are there none?
Can we infer anything about what llm's can achieve from what we can achieve with AIs like AlphaGo? I thought their approaches were completely separated
Gpts are a class of text predictors. Ultimately they are ranked on whether or not the output is similar to the training data, text-wise. If the training data included a game then it may be able to play that game, but only if that game requires reasoning about entire words (because of tokenization, gpts can't reason in terms of letters, that's why they do poorly at crosswords for example)
On the flip side, alphazero is a class of networks that have a list of actions they can take, and a list of parameters they observe about the game (in chess: the board position, in other games: their position on screen, score, speed, etc). The model is then trained to take actions that maximize an actual hard value from the game, like winning a game of chess, capturing a piece, increasing a score, driving the furthest.
In theory you could train a model with the alphago method to do text prediction, but LLMs are called "large" for a reason, the input and output spaces would have to be the number of possible tokens (and at that point just train a normal gpt, it's much more efficient). Also in theory you could train a gpt to play games, but you're spending huge amounts of compute evaluating extraneous words in the input (the prompt) and the output (most words do not have anything to do with your game). on top of that, you're iterating over every word you generate to generate the next one, so you're doing multiple passes of this largely infficient computing, which means you're slower compared to a tailor-made model that can evaluate one situation once and give you a list of outputs to perform.
in this specific case it's a bit wierd because the input space for the alphazero model would have to be every word that can appear on the board, but the reasoning part is most likely not a problem given enough model size. since it's competing with a multi-gigabyte llm though, there is space to spare.
The problem is that Connections is ultimately a search problem that requires more than simply grouping similar words. There are lots of combinations to assess. I bet if you enumerate, score, then rank all possible groupings, an LLM would perform much better.
(it even ignored some embarrassing typos ...)
Connections is much less so.
On the other hand, it feels like the kind of thing where an LLM might be surprisingly good, because it could, in theory, be able to see more correlations than a human can. Based on these results I guess my intuition seems to hold up.
I wonder if a better / different way to approach this could be more "algorithmically" - maybe have the LLM generate a list of possible categories for each individual word and then try to operate on those associations?
Cool article!
The game itself is sort of an embeddings clustering problem, with the added difficulty that each group needs to only be alike in 1 way (versus a full vector distance which measures how alike they are in every way).
Maybe there is some way to search for a vector of weights, which, when multiplied by all members of a group of 4, produces weighted vectors with the least distance from their center? And then it’s an optimization problem to find the 4 groups that find minimize the total distance from each groups center?
It may be possible to find a weight vector that selects for a particular slice of a words meaning
My feeling is that it'll struggle with word-plays in OnlyConnect/Connections (like missing letters, added letters, words-within-words homophones, etc) as well as two-step references (such as {Venice, Dream, Night, Nothing} => "last words of Shakespeare plays"}).
I thought it would. But I've spent a fair bit of effort both using embeddings and also using prompts to GPT4, as well as combinations of the two approaches, to try to make a good spymaster for Codenames with essentially zero success.
I wonder if something like https://wordassociations.net/en might be better for it than embeddings.
So really you could fine tune 3 models - one for 16 words, one for 12, and one for 8. Then use them in succession.
Also, if you come across a mistake at the end (have some negative examples in the training sets), tell it to start over and add to the prompt what you think is NOT a group.
Asking GPT to just pick any group, adds a lot of extra "mental overhead".
Though of course this works best if all the groups are roughly of the same difficulty.
Just making up that example, but it's very common that multiple words will all appear to be one group, and actually each one belongs to a different group.
GPT-3.5 learns how to generalize better when it’s just in the completion.
This is the same problem that vexed the researchers who did the paper on the alleged reversal curse.
(https://andrewmayne.com/2023/11/14/is-the-reversal-curse-rea...)
- A two-pass approach where you prompt it to generate the groupings, then separately prompt it to find the words that belong into each group. (Which of the following words best fit the category "COMMON DOG NAMES"?). It does way better at the more specific queries.
- Don't tell it the constraints of 4 groups of 4 words; ask it for at least four groups of 2 or more words. Once you have 4+ groups of 4+ words, you can make logical inferences with your Python wrapper to come up with all the possible 4x4 groupings. If you're lucky there will only be one. If not... more queries to GPT, I guess, but I haven't figured this part out.See "chain of thought" e.g
1. Have it do a thinking/brainstorming phase first to try to work out what the potential categories are.
2. Then ask it to scan over each word and think about what categories it could go in, in order of likelihood.
3. Ask it to do the final answer.
Format the training set in that way, as if it got everything right at each step (since you only have the right answers).
It sounds like you had 7 * 30 = around 200 examples. Maybe you can feed a batch of ten at a time and explain the game and try to get GPT-4 to generate more examples. You will have to see if they make sense.
I assume that by increasing the size of the dataset by a factor of ten, and having the LLM think through the problem using multiple steps, you will get significantly better results.
No matter how I instructed it to think, it frequently could not work out the very first category.
If this is the case, performance might be improved by taking the final solving responsibility away from the model and giving it to the script. You could ask GPT for categories, ask whether each word fits each category (discarding categories with fewer than 4 words), and then search for 4 non-overlapping categories.
(This might be missing the point of the exercise though.)
I also don't see how it makes it "sharable". Wouldn't it be more sharable if they let everyone win and just give them a score?