Show HN: GPT crushes my high score in 2048.io (opens in new tab)

(github.com)

31 pointsinishchith3y ago37 comments

37 comments

I can't think of a particularly polite way of putting this, so I apologise, but did you take some liberties with that title, or are you just not very good at 2048? Random inputs may get you a higher score than that.

I looked at the code to see the prompt, and I think it's a very limited way of having GPT play, the context has no board history or information on how and where the game inserts piece, the AI won't be able to execute any strategy.

jxf3y ago

There shouldn't need to be any board history, just like you don't need to know the board history to know the next move in checkers.

Kaotique3y ago

2048 spawns new numbers randomly on the board. A human player uses that information to determine the best move.

3 more replies

jychang3y ago

OP might be my little brother. Years back, I destroyed him in 2048, and then he called me a virgin. Then he wrecked me at a round of chess.

(I was engaged then, currently married now)

Ahh, good times.

ggerganov3y ago

I just played 2 games randomly pressing all arrow keys with closed eyes and got a score of ~1100 in the first game and 1468 in the second game. OP's AI agent scored 1348

matsemann3y ago

Tried as well. Got 4344 by button smashing left/up and when it got stuck sometimes right or down. https://imgur.com/a/k7cOPhY

have_faith3y ago

What would be a good benchmark score for an AI? I just got 4500~ without any strategy.

ggerganov3y ago

Just make the same Javscript that provides random moves, make it play 100 games and take the average score.

Then do the same using GPT and compare the scores.

Anything else is just cherry-picking

1 more reply

iamflimflam13y ago

You can get much great results at playing games with ChatGPT by adding “memory” as shown here - https://youtu.be/4oQUsiPsbOQ and here - https://youtu.be/lXFeq2yUy58

Prompting it to remember the game state and feeding it back in or offloading that job to a plugin can get some very Interesting results.

streakfix3y ago

OP is just bad at 2048

1 more reply

mkl953y ago

It looks like the wrong AI for the problem. RL should be more successful.

Kaotique3y ago

AlphaStar or AlphaGo were trained on thousands of games and played hundreds of thousands of games to learn what works and what doesn't. This model is told the rules of the game and knows how to use the 4 basic inputs. I don't think it has any notion of strategy, good or bad moves.

But maybe with a feedback loop it will improve.

Pretty cool example though to see its limitations.

antegamisou3y ago

Came here to read this, it'd actually be a very good project (DIY 2048 AI) to take up to get your hands dirty with Reinforcement Learning.

lordnacho3y ago

This is a pretty interesting example, as I think most of you will find that after playing 2048 for a bit, you discover a way to beat the game every time. I'm not sure I could code it up, but the heuristic is basically to keep the big numbers squashed against the same edge (eg the top) the whole game while using left-right-up movements to squash the smaller numbers into each other as opportunities arise.

I don't think GPT could figure this out. My impression of it lately is that it's a sort of very advanced cargo cultist, maybe with a bit of superficial intelligence confined to the linguistic sphere. Asking it for a history essay gives you a grammatically perfect melange of likely terms that will do just fine for high school but possibly not for graduate level studies.

I've never seen it do anything where I thought it had a parsimonious internal model of the problem. For instance I had it tell me about the quadratic equation, and the explanation was fine. When it came to plugging in numbers, it utterly failed, though the presentation was as if it understood it. If it had just a simple calculator inside it, this wouldn't be a problem.

This game is also pretty simple, and for the same reason I don't think it can actually do it.

mschuster913y ago

> I've never seen it do anything where I thought it had a parsimonious internal model of the problem. For instance I had it tell me about the quadratic equation, and the explanation was fine. When it came to plugging in numbers, it utterly failed, though the presentation was as if it understood it. If it had just a simple calculator inside it, this wouldn't be a problem.

That's - at the moment, AFAIU - a limitation of the tokenizers used to interface with LLMs. Basically, the model "calculates" bullshit because the input layer doesn't get correct inputs from the tokenizer.

joshspankit3y ago

I’m not sure if you know yet, but one of the first ChatGPT plugins was WolframAlpha. That adds a rather advanced calculator to it’s toolset.

geraldwhen3y ago

A set of random inputs will likely beat 128.

Heck, using the heuristic of down/left/repeat until blocked, up, back to start will win some games.

austin-cheney3y ago

My highest score on 2048 was in 2015 when I scored over 120,000. I was close but failed to achieve the 16384 block. If an AI were trained to play that game I would expect similar performance, because there are very few strategies to this game but those strategies will comprise multiple moves to solve for a given problem in the moment.

I also modified the original game code to allow a board of different sizes. The modifications are just a minor fix to the CSS, an input field for board size, an the corresponding JS for that input field.

https://github.com/prettydiff/2048

tarruda3y ago

The prompt contains rules of the game and a few examples.

Does anyone know if GPT really learns 2048 only from this prompt or most of its knowledge came training data?

cwillu3y ago

The theory that gpt has merely learned the acceptable commands needs to be considered more likely, given the other comments pointing out that random input will do as well as gpt apparently does.

Kaotique3y ago

It is not building towards a corner which makes it super weak to random new numbers appearing in bad places.

vorticalbox3y ago

little JS trick for you you can use Array.from so you can skip one of the .fill

> let board = Array.from({length: N_ROWS}).map(() => new Array(N_COLS).fill(0));

probably doesn't matter for this size of a board but its one less loop of the array

tremarley3y ago

OP is just super bad at 2048

bl0rg3y ago

But he's excellent at clickbait!

kazinator3y ago

GPT4 just told me that a billion golf balls take up 40 billion cc, and since packing efficiency of spheres is as high as 74%, they take up 26 billion cc.

No, you have to divide by 0.74, not multiply.

In shit that GPT4 is not trained on (like code code code and more code), it can get really goofy.

Earlier in the same chat about golf balls, it claimed that if brain cells were the size of golf balls (an imaginary thread I started) there would have to be 40 billion of them. That doesn't follow; the number of brain cells is an external quantity that we hold constant, not related to what size we are imagining them to be. (The number is wrong too, the common estimate is over 80 billion.)

GPT4 wheedles tidbits of information out of your own questions and tries to work them into answers. For instance, today it claimed that the Lomuto partitioning scheme often seen in Quicksort implementations requires external storage of one bit per array element. That's utterly false; it requires no external storage proportional to the array, just a few registers to manipulate the values and array indices and whatnot. I had talked about an idea involving one bit of storage earlier in the chat. The stochastic DJ just jammed a needle into that groove and went with it.

I asked it where can I get a copy of Hoare's original paper on Quicksort. It said that it's hard to find because the paper is very old, blah blah. Just excuses for not knowing where that might be. I switched to another window and found in two seconds with Google on an Oxford website, free PDF download of complete text.

A few days ago I asked GPT4 what is the cell of a honeycomb called in Japanese. It told me instead what a honeycomb is called. I explained, the cell of a honeycomb is a distinct object from the honeycomb. It had no idea what the cell might be called, in spite of being capable of chatting in fluent Japanese with you at the drop of a hat.

I found the info in the Japanese Wikipedia on honeycombs: a caption under a picture of them calls them "heya", which is a common word for room (e.g. bedroom). Guess that's not one of the billions of texts it has assimilated.

Another trick up GPT4's scheme is to ask you for hints when it can't solve something. You have to give it so many hints that it doesn't need to solve the actual problem, but then it acts like it has reasoned it out. When confronted it admits, yes, sorry, the answer was deduced from your hints in such and such a way.

Can't say it's not entertaining, though.

I went through this protracted exercise whereby I took a paragraph from Edgar Allan Poe, and encrypted it with a Vigenère cipher. I convinced GPT4 to try to crack it. First I had to get past its ethical objections. We worked out a protocol whereby it can ask me questions, the answers to which prove that I know the plaintext and key, without it revealing the key to me. Eventually it forgot about its ethical obligation and was revealing to me the key that it thinks it might be. Which, if that were right, would amount to cracking the text for me.

I convinced it to actually perform the letter frequency analysis to try to crack the key length. It was close so I just gave that away.

In my ciphertext, I preserved word divisions, and also case. I told GPT4 about this and encouraged it to use the information. Like a single-letter lower case ciphertext word is likely a. It tried to use this but was getting the position wrong, and the key offset wrong and other logical issues. In the end, I gave it so many hints about where the plaintext comes from that it pulled it from the network and then pretended to have solved the problem.

It then made up a fictitious Vigenère key and sad, hey look, with this key that I cracked, your ciphertext decodes to the first paragraph oif the Fall of the House of Usher. I reminded that it couldn't possibly be the key because the real one is six characters long (as we established several times in the chat). It was basically just spewing smooth sounding text.

It's not pure bullshit. It's like raisins of clarity in a pudding of bullshit or something. We are seeing some sparks of something that resembles intelligence. In 5, 15 years we will be having different conversations about this stuff (not to mention with).

pmoriarty3y ago

It would be interesting to know how Anthropic's Claude+[1] (or at least plain old Claude[2]) would fare on these tests.

[1] - https://poe.com/Claude%2B

[2] - https://poe.com/Claude-instant

streakfix3y ago

Pretty smart for a 3-year-old if you ask me.

andrepd3y ago

Yet according to many HNers we're scarce months from AGI that can replace programmers or synthesise 6-season TV series. :)

sebzim45003y ago

Can you link to someone actually saying this? The whole debate seems like people are simply talking past each other, with one side saying "this is a useful tool" and one side saying "this tool is not going to turn into skynet in the next two weeks, so who cares?"

1 more reply

mschuster913y ago

Given that a lot of what programmers do is write CRUD-style apps and a lot of TV shows and movies - particularly the big IPs - follow one and the same formula... it's not like that argument has some merit.

AI is rapidly approaching a quality level near to "good enough with a bit of human cleanup afterwards".

2 more replies

j / k navigate · click thread line to collapse

37 comments

sdflhasjd3y ago

jxf3y ago

There shouldn't need to be any board history, just like you don't need to know the board history to know the next move in checkers.

Kaotique3y ago

2048 spawns new numbers randomly on the board. A human player uses that information to determine the best move.

3 more replies

jychang3y ago

OP might be my little brother. Years back, I destroyed him in 2048, and then he called me a virgin. Then he wrecked me at a round of chess.

(I was engaged then, currently married now)

Ahh, good times.

ggerganov3y ago

I just played 2 games randomly pressing all arrow keys with closed eyes and got a score of ~1100 in the first game and 1468 in the second game. OP's AI agent scored 1348

matsemann3y ago

Tried as well. Got 4344 by button smashing left/up and when it got stuck sometimes right or down. https://imgur.com/a/k7cOPhY

have_faith3y ago

What would be a good benchmark score for an AI? I just got 4500~ without any strategy.

ggerganov3y ago

Just make the same Javscript that provides random moves, make it play 100 games and take the average score.

Then do the same using GPT and compare the scores.

Anything else is just cherry-picking

1 more reply

iamflimflam13y ago

You can get much great results at playing games with ChatGPT by adding “memory” as shown here - https://youtu.be/4oQUsiPsbOQ and here - https://youtu.be/lXFeq2yUy58

Prompting it to remember the game state and feeding it back in or offloading that job to a plugin can get some very Interesting results.

streakfix3y ago

OP is just bad at 2048

1 more reply

mkl953y ago

It looks like the wrong AI for the problem. RL should be more successful.

Kaotique3y ago

But maybe with a feedback loop it will improve.

Pretty cool example though to see its limitations.

antegamisou3y ago

Came here to read this, it'd actually be a very good project (DIY 2048 AI) to take up to get your hands dirty with Reinforcement Learning.

lordnacho3y ago

This game is also pretty simple, and for the same reason I don't think it can actually do it.

mschuster913y ago

joshspankit3y ago

I’m not sure if you know yet, but one of the first ChatGPT plugins was WolframAlpha. That adds a rather advanced calculator to it’s toolset.

geraldwhen3y ago

A set of random inputs will likely beat 128.

Heck, using the heuristic of down/left/repeat until blocked, up, back to start will win some games.

austin-cheney3y ago

https://github.com/prettydiff/2048

tarruda3y ago

The prompt contains rules of the game and a few examples.

Does anyone know if GPT really learns 2048 only from this prompt or most of its knowledge came training data?

cwillu3y ago

The theory that gpt has merely learned the acceptable commands needs to be considered more likely, given the other comments pointing out that random input will do as well as gpt apparently does.

Kaotique3y ago

It is not building towards a corner which makes it super weak to random new numbers appearing in bad places.

vorticalbox3y ago

little JS trick for you you can use Array.from so you can skip one of the .fill

> let board = Array.from({length: N_ROWS}).map(() => new Array(N_COLS).fill(0));

probably doesn't matter for this size of a board but its one less loop of the array

tremarley3y ago

OP is just super bad at 2048

bl0rg3y ago

But he's excellent at clickbait!

kazinator3y ago

GPT4 just told me that a billion golf balls take up 40 billion cc, and since packing efficiency of spheres is as high as 74%, they take up 26 billion cc.

No, you have to divide by 0.74, not multiply.

In shit that GPT4 is not trained on (like code code code and more code), it can get really goofy.

Can't say it's not entertaining, though.

I convinced it to actually perform the letter frequency analysis to try to crack the key length. It was close so I just gave that away.

pmoriarty3y ago

It would be interesting to know how Anthropic's Claude+[1] (or at least plain old Claude[2]) would fare on these tests.

[1] - https://poe.com/Claude%2B

[2] - https://poe.com/Claude-instant

streakfix3y ago

Pretty smart for a 3-year-old if you ask me.

andrepd3y ago

Yet according to many HNers we're scarce months from AGI that can replace programmers or synthesise 6-season TV series. :)

sebzim45003y ago

1 more reply

mschuster913y ago

AI is rapidly approaching a quality level near to "good enough with a bit of human cleanup afterwards".

2 more replies

j / k navigate · click thread line to collapse