undefined | Better HN

0 pointsfreediver1y ago0 comments

Impressed by the model so far. As far as independent testing goes, it is topping our leaderboard for chess puzzle solving by a wide margin now:

https://github.com/kagisearch/llm-chess-puzzles?tab=readme-o...

0 comments

thrance1y ago

Nice project! Are you aware of the following investigations: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

Some have been able to achieve greater elo with a different prompt based on the pgn format.

gpt-3.5-turbo-instruct was able to reach an elo of ~1750.

chermanowicz1y ago

On one hand, some of these results are impressive; on the other, the illegal moves count is alarming - it suggests no reasoning ability as there should never be an illegal move? I mean, how could a violation of a fairly basic game (from a rules perspective) be acceptable in assigning any 'outcome' to a model other than failure?

freediverOP1y ago

Agreed, this is what makes evaluating this very hard. A 1700 Elo chess player would never make an illegal move, let alone have 12% illegal moves.

So from the model's perspective, we have at the same time display of both brilliancy (most 1700 chess players would not be able to solve as many puzzles by looking just at the FEN notation) and on the other side complete lack of any understanding of what is it trying to do from a fundamental, human-reasoning level.

vbezhenar1y ago

That's because LLM does not reason. For me, as a layman, that seems strange that they don't wire some kind of Prolog engine to fill the gap, (like they wired Python to fill the gap in arithmetic) but probably it's not that easy.

2 more replies

mistrial91y ago

my guess is -- the probabilistic engine does sequence variation and it just will not do anything else.. so a simple A->B sort of logic is elusive at a deep level; secondly the adaptive and very broad kinds of questions and behaviors it handles, also make it difficult to write logic that could correct defective answers to simple logic.

elicksaur1y ago

> and Kagi is well positioned to serve this need.

>CEO & founder of Kagi

Important context for anyone like me who was wondering where the boldness of the first statement was coming from.

Edit: looks like the parent has been edited to remove the claim I was responding to.

muzani1y ago

My favorite part of HN is just casually bumping into domain experts and celebrities without realizing it. No profile pic is such a feature.

freediverOP1y ago

Yeah, it was an observation that was better suited for a tweet than HN. Here it is:

https://twitter.com/vladquant/status/1790130917849137612

1 more reply

Powdering70821y ago

Wow from adjusted ELO of 1144 to 1790, that's a huge leap. I wonder if they are giving it access to a 'scratch pad'

muzani1y ago

My guess is that handling visual stuff directly accidentally gives it some powers similar to Beth Harmon.

borgdefense1y ago

I wasn't impressed in the first 5 minutes of using it but it is quite impressive after 2 solid hours of random topics.

Much faster for sure but I have also not had anything give an error in python with jupyter. Usually you could only stray so far with more obscure python libraries before it starts producing errors.

That much better than 4 in chess is pretty shocking in a great way.

mewpmewp21y ago

I see you have Connect 4 test there.

I tried playing against the model, it didn't do well in terms of blocking my win.

However it feels like it might be possible to make it try to think ahead in terms of making sure that all the threats are blocked by prompting well.

Maybe that could lead to somewhere, where it will explain its reasoning first?

This prompt worked for me to get it to block after I put 3 in the 4th column. It otherwise didn't

Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns.

Always respond with JSON of the following format:

type Response ={

      am_i_forced_to_block: boolean;

      other_considerations: string[];

      explanation_for_the_move: string;

      column_number: number;

}

I start with 4.

Edit:

So it went

Me: 4

It: 3

Me: 4

It: 3

Me: 4

It: 4 - Successful block

Me: 5

It: 3

Me: 6 - Intentionally, to see if it will win by putting another 3.

It: 2 -- So here it failed, I will try to tweak the prompt to add more instructions.

me: 4

freediverOP1y ago

Care to add a PR?

1 more reply

itissid1y ago

Have you tried replacing the input string with a random but fixed mapping and obfuscate that its chess(like replace the word 'chess' with say, 'an alien ritual practice') and see how it does?

parhamn1y ago

Is the test set public?

freediverOP1y ago

Yes, in the repo.

1 more reply

DatoClement1y ago

I think specifying chess rules at the beginning of the prompt might help mitigate the problem of illegal moves

mritchie7121y ago

woah, that's a huge leap, any idea why it's that large of a margin?

using it in chat, it doesnt feel that different

whimsicalism1y ago

would love if you could do multiple samples or even just resampling and get a boostrapped CI estimate

j / k navigate · click thread line to collapse

0 comments

thrance1y ago

Nice project! Are you aware of the following investigations: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

Some have been able to achieve greater elo with a different prompt based on the pgn format.

gpt-3.5-turbo-instruct was able to reach an elo of ~1750.

chermanowicz1y ago

freediverOP1y ago

Agreed, this is what makes evaluating this very hard. A 1700 Elo chess player would never make an illegal move, let alone have 12% illegal moves.

vbezhenar1y ago

2 more replies

mistrial91y ago

elicksaur1y ago

> and Kagi is well positioned to serve this need.

>CEO & founder of Kagi

Important context for anyone like me who was wondering where the boldness of the first statement was coming from.

Edit: looks like the parent has been edited to remove the claim I was responding to.

muzani1y ago

My favorite part of HN is just casually bumping into domain experts and celebrities without realizing it. No profile pic is such a feature.

freediverOP1y ago

Yeah, it was an observation that was better suited for a tweet than HN. Here it is:

https://twitter.com/vladquant/status/1790130917849137612

1 more reply

Powdering70821y ago

Wow from adjusted ELO of 1144 to 1790, that's a huge leap. I wonder if they are giving it access to a 'scratch pad'

muzani1y ago

My guess is that handling visual stuff directly accidentally gives it some powers similar to Beth Harmon.

borgdefense1y ago

I wasn't impressed in the first 5 minutes of using it but it is quite impressive after 2 solid hours of random topics.

Much faster for sure but I have also not had anything give an error in python with jupyter. Usually you could only stray so far with more obscure python libraries before it starts producing errors.

That much better than 4 in chess is pretty shocking in a great way.

mewpmewp21y ago

I see you have Connect 4 test there.

I tried playing against the model, it didn't do well in terms of blocking my win.

However it feels like it might be possible to make it try to think ahead in terms of making sure that all the threats are blocked by prompting well.

Maybe that could lead to somewhere, where it will explain its reasoning first?

This prompt worked for me to get it to block after I put 3 in the 4th column. It otherwise didn't

Always respond with JSON of the following format:

type Response ={

      am_i_forced_to_block: boolean;

      other_considerations: string[];

      explanation_for_the_move: string;

      column_number: number;

}

I start with 4.

Edit:

So it went

Me: 4

It: 3

Me: 4

It: 3

Me: 4

It: 4 - Successful block

Me: 5

It: 3

Me: 6 - Intentionally, to see if it will win by putting another 3.

It: 2 -- So here it failed, I will try to tweak the prompt to add more instructions.

me: 4

freediverOP1y ago

Care to add a PR?

1 more reply

itissid1y ago

Have you tried replacing the input string with a random but fixed mapping and obfuscate that its chess(like replace the word 'chess' with say, 'an alien ritual practice') and see how it does?

parhamn1y ago

Is the test set public?

freediverOP1y ago

Yes, in the repo.

1 more reply

DatoClement1y ago

I think specifying chess rules at the beginning of the prompt might help mitigate the problem of illegal moves

mritchie7121y ago

woah, that's a huge leap, any idea why it's that large of a margin?

using it in chat, it doesnt feel that different

whimsicalism1y ago

would love if you could do multiple samples or even just resampling and get a boostrapped CI estimate

j / k navigate · click thread line to collapse