Are you better than a language model at predicting the next word? (opens in new tab)

(joel.tools)

233 pointsJoelEinbinder1y ago103 comments

103 comments

85 comments · 47 top-level

jsnell1y ago· 6 in thread

It's a neat idea, though not what I expected from the title talking about "smart" :)

You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.

First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.

codetrotter1y ago

> not what I expected from the title talking about "smart"

I think the title is mainly a reference to the TV show “Are you smarter than a fifth grader?”

Fittingly then, is the fact that a lot of types of questions that they were asking in that TV show was mostly trivia. Which I also don’t think of as being a particularly important characteristic of being “smart”.

When I think of “smart” people, I think of people who can take limited amount of information and connect dots in ways that others can’t. Of course it also builds on knowledge. You need to have specific knowledge in the first place to make connections. But knowing facts like “the battle of so and so happened on August 18th 1924, one hundred years ago today” alone is not “smart”. A smart person is someone who uses knowledge in a surprising way. Or in a way that others would not have been able to. After the smart person made the connection others might also go like “oh that’s so obvious why didn’t I think about that” or even “yeah that’s really obvious, I could’ve thought of that too”. And yet the first person to actually make, and properly communicate that connection was the smart one. Smart exactly because they did.

JoelEinbinderOP1y ago

If you want to practice it one question at at time, you set the question count to 1. https://joel.tools/smarter/?questions=1

When I tested it this way it resulted in less of an emotional reaction.

lupire1y ago

I retired as worldwide champion (tied) of text prediction.

  you: 0/1
  gpt-4o: 0/1
  gpt-4: 0/1
  gpt-4o-mini: 0/1
  llama-2-7b: 0/1
  llama-3-8b: 0/1
  mistral-7b: 0/1
  unigram: 0/1

1 more reply

KTibow1y ago

If you're looking for "knowledge" try https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

j_bum1y ago

This is fun!

I bet this could be a unique testing resource for aspiring Jeapordy contestants.

dang1y ago

Thanks - we've LLMified the title.

stackghost1y ago· 3 in thread

This is just a test of how likely you are to generate the same word as the LLM. The LLM does not produce the "correct" next word as there are multiple correct words that fit grammatically and can be used to continue the sentence while maintaining context.

I don't see what this has to do with being "smarter" than anything. Example:

1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____

a) ether

b) a

c) the

d) more

But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?

JoelEinbinderOP1y ago

The LLM didn’t generate the next word. Hacker News commenters did. You can see the source of the comment on the results screen.

sigbottle1y ago

Do LLM's generate words on the fly or can they sort of "go back" and correct themselves? stackghost brought up a good point I didn't think about before

2 more replies

DiscourseFan1y ago

At this point, we've all gotten quite used to the "style" of LLM outputs, and personally I doubt this is the case, however, it is possible that there is some, shall we say, corruption of the data here, since it was not possible to measure the ability of LLMs to predict the next word before there were LLMs.

I propose you do the same things, but only include HN content from before the existence of LLMs. That should ensure there is no bias towards any of the models.

2 more replies

Kiro1y ago· 3 in thread

Where do the incorrect options come from?

manuelmoreale1y ago

In another comment the author wrote

> I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments

So I guess the correct answer comes from the HN user who wrote the comment?

Kiro1y ago

Yeah, but I was wondering about the incorrect options.

lupire1y ago

I suspect they come from the LLMs.

JoelEinbinderOP1y ago· 2 in thread

I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments and compete against various language models. I used llama2 to generate three alternative completions for each comment creating a multiple choice question. For the local language models that you are competing against, I consider them having picked the answer with the lowest total perplexity of prompt + answer. I am able to replicate this behavior with the OpenAI models by setting a logit_bias that limits the llm to pick only one of the allowed answers. I tried just giving the full multiple choice question as a prompt and having it pick an answer, but that led to really poor results. So I'm not able to compare with Claude or any online LLMs that don't have logit_bias.

I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.

jonahx1y ago

"This exercise helped me to understand how language models work on a much deeper level."

I'd like to hear more on this.

0xDEADFED51y ago

It's an interesting test, pretty cool idea. Thanks for sharing

chmod7751y ago· 2 in thread

    you: 4/15
    gpt-4o: 0/15
    gpt-4: 1/15
    gpt-4o-mini: 2/15
    llama-2-7b: 2/15
    llama-3-8b: 3/15
    mistral-7b: 4/15
    unigram: 1/15

Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.

If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.

Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?

Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.

JoelEinbinderOP1y ago

On the full set of 1000 questions, the language models are getting 30-35% correct. With patience, humans can do 40-50%.

The language models were prompted with the text + each candidate answer, and the one with the lowest perplexity was picked. I tried to avoid instruction tuned models wherever possible to avoid the "voice" problem.

exit1y ago

i'm curious, how did you arrive at "40-50%" possible human performance?

the task of "predicting the next word" can be understood as either "correctly choosing the next word in the hidden context", or "predicting the likelihood of each possible word".

the quiz is evaluating against the former, but humans are still far from being able to express a percentile likelihood for each possibility.

i only consciously arrive at a vague feeling of confidence, rather than being able to weigh the prediction of each word with fractional precision.

one might say that LLMs have above human introspective ability in that regard.

nojs1y ago· 2 in thread

Nice. I found you can beat this by picking the word least likely to be selected by a language model, because it seems like the alternative choices are generated by an LLM. “Pick the outlier” is the best strategy.

This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.

layer81y ago

This was always my strategy for Who Wants to Be a Millionaire?. Pick the answer that would seem the most unlikely to be listed if any of the other three answers were the correct one.

JoelEinbinderOP1y ago

What scores are you getting using this technique?

silisili1y ago· 2 in thread

Was mine broken? One of my prompts was just '>'. So of course I guessed a random word. The answer key showed I got it wrong, but showed the right answer inserted into a longer prompt. Or is that how it's supposed to work?

JoelEinbinderOP1y ago

That isn't how it's supposed to work. I mean sometimes you get a supper annoying prompt like ">", but if you guess the right answer it should give you the point. I just checked the two prompts like that, and they seem to work for me.

silisili1y ago

Right, I got the answer incorrect, so that part worked right. I just wasn't sure if the question was intentionally clipped and missing that context, but it does sound intentional. I guess I make a poor LLM!

akira25011y ago· 2 in thread

Yes. I can tell you about things that happened this morning. Your language model cannot.

manuelmoreale1y ago

I can also invite you out for a coffee and your LLM can’t do that either–yet.

Squeeze26641y ago

They're perfectly capable of inviting you out for coffee. They just can't show up yet.

2 more replies

moralestapia1y ago· 2 in thread

>the quintessential language model task of predicting the next word?

Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.

sorokod1y ago

The one provided in the original post.

moralestapia1y ago

I don't see any of that.

Quote?

1 more reply

lostmsu1y ago· 2 in thread

I think this is a good joke on nay-sayers. But if author is here, I would like a clarification if user is picking the next token or the next word? Cause if it is the latter, I think this test is invalid.

JoelEinbinderOP1y ago

The language model generating the candidate answers generates tokens until a full word is produced. The language models picking their answer choose the completion that results in the lowest perplexity independent of the tokenization.

lostmsu1y ago

I'd say the test is still not quite valid, and more of in between the original "valid" task and "guess what LLM would say" as suggested in another comment here. The reason is: it might be easier for LLMs to choose the completion out of their own generated variants (1) than the real token distribution.

1. perhaps even out of variants generated by other LLMs

mjcurl1y ago· 2 in thread

5/15, so the same as choosing the most common word.

I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.

dalton011y ago

It says choosing the most common word was just 1/5 (and their best LLM was 4/15)

toxik1y ago

Yeah, it should be sentences that have low next token distribution entropy. Where an LLM is sure what the next word is. I bet people do real well on those too. By the way, I also had 5/15.

modeless1y ago· 1 in thread

> You scored 11/15. The best language model, llama-2-7b, scored 10/15.

I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.

EvgeniyZh1y ago

Given the high scores, I guess it was an easy one. I've taken the longer one, and got the following

> You scored 28/100. The best language model, gpt-4, scored 32/100. The unigram model, which just picks the most common word without reading the prompt, scored 28/100.

Assuming complexity averages out on N=100, small test with LLM score above ~5 is "easy"

anikan_vader1y ago· 1 in thread

Got 8/15, best AI model got 7/15, and unigram got 1/15.

Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.

sethammons1y ago

I have wasted an inordinate amount of time hn. i scored 2/15

zoklet-enjoyer1y ago· 1 in thread

You scored 6/15. The best language model, gpt-4o, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 2/15.

Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!

e12e1y ago

    you: 8/15
    gpt-4o: 2/15
    gpt-4: 4/15
    gpt-4o-mini: 4/15
    llama-2-7b: 5/15
    llama-3-8b: 5/15
    mistral-7b: 6/15
    unigram: 5/15

> You scored 8/15. The best language model, mistral-7b, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 5/15.

(In I think 120 seconds - didn't copy that part).

Interesting that results differ this much between runs (for the LLMs).

Surely someone did better than me on their first run?

Ed: I wonder if the human scores correlate with age of hn account?

greesil1y ago· 1 in thread

Like a ML model I would prefer being scored with cross entropy and not right/wrong. Like, I might guess wrong but it might not be that far off in likelihood.

kqr1y ago

It is mitigating that we get so many questions, but I agree it's inefficient. As a human forecaster I also prefer being judged in part on my confidence in each of the alternatives.

fsndz1y ago· 1 in thread

Of course not, but that does not mean LLMs will lead to AGI. We might never build AGI in fact: https://www.lycee.ai/blog/why-no-agi-openai

shkkmo1y ago

That article, disapointingly, doesn't provide any arguments as to why we can't build AGI.

StefanBatory1y ago· 1 in thread

7/15, 90 seconds. I'll blame it on fact that I'm not English native speaker, right? Right?

On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.

seabass-labrax1y ago

I am a native English speaker and only got 5/15 - and it took me over 100 seconds. You have permission to bask in the glory of your superiority over both GPT4 and your fellow HN readers!

globular-toast1y ago· 1 in thread

Everything I picked was grammatically correct, so I don't see the point. Is the point of a "language model" just to recall people's comments from the internet now?

tmalsburg21y ago

Always has been.

rlt1y ago· 1 in thread

Is this with the “temperature” parameter set to 0? Most LLM chatbots set it to something higher.

It would be interesting to try varying it, as well as the seed.

JoelEinbinderOP1y ago

Temperature doesn't play a role here, because the LLM is not being sampled (other than to generate the candidate answers). Instead the answer the llm picks is decided by computing the complexity for the full prompt + answer string.

lelanthran1y ago· 1 in thread

This is a nonsense test. There is no context, so the 'next' word after the single word 'The' is effectively random.

I'm pretty certain that LLMs are unable to work at all without context.

nmstoker1y ago

They will "work", ie give a prediction, it's simply that it will have a pretty low probability of being the correct answer, which is a consequence of the highly limited context.

IMHO that doesn't make it nonsense, but maybe you are reading something different into the purpose of this test to what I am.

User231y ago· 1 in thread

With some brief experimentation ChatGPT also fails this test.

lostmsu1y ago

It might make sense: any kind of fine-tuning of LLMs usually reduces generalization capabilities, and instruction-tuning is a kind of fine-tuning.

layer81y ago

This is also a good test for noticing that you spend too much time reading HN comments.

moritzwarhier1y ago

This is the best interactive website about LLMs at a meta level (so excluding prompt interfaces for actual AIs) that I've seen so far.

Quizzes can be magical.

Haven't seen any cooler new language-related interactive fun-project on the web since:

https://wikispeedruns.com/

It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.

Sharing this with a general audience could spark funny discussions about bubbles and biases :)

RheingoldRiver1y ago

I don't quite understand, what makes "Okay I've" more correct than "Okay so"? No meaningful context was provided here, how do we know "Okay I've" was at all meaningfully correct?

For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?

Also, I got bored halfway through and selected "D" for all of them

pizza1y ago

If the samples came from HN, I wonder how likely it is that the text is already a part of a dataset (ie common crawl snapshot) so that the LLMs have already seen them?

edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.

jdthedisciple1y ago

Some of them are excerpts from a much larger context, which the LLM would be using for prediction, obviously giving them a gigantic edge.

Garlef1y ago

I like it. It's a humorous reversal of the usual articles that boil down to "Look! I made the AI fail at something!"

TacticalCoder1y ago

My computer can compute 573034897183834790x3019487439184798 in less than a millisecond. Doesn't make it smarter than me.

ChrisArchitect1y ago

Who's Smarter: AI or a 5-Year-Old?

https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/

(https://news.ycombinator.com/item?id=41263363)

kqr1y ago

For anyone else daring the full 100 question quiz: you need to get at least a third right to be considered better than guessing by traditional statistical standards. (You'd need more than half to be better than LLMs.)

dataflow1y ago

I got 9/15, vs. 4/15 for an LLM. I assume these are lifted from HN? Seems like an indication I should spend less time here...

blitzar1y ago

I took some mushrooms and hallucinated the answers.

nick34431y ago

This isn't really the challenge (loss function) that language models are trained on. It's not a simple next-word challenge, they get more context, see how BERT was trained as a reference.

shakna1y ago

So... If I picked the same results, in the same timeframe... And I don't think glue should go on pizza... Does that mean LLMs are completely useless to me?

lupire1y ago

I got one of my own comments on the 15 question quiz!

wesselbindt1y ago

I like the website, but it could be a bit more explicit about the point it's trying to make. Given that a lot of people tend to think of LLM as somehow a thinking entity rather than a statistical model for guessing the most likely next word, most will probably look at these questions and think the website is broken.

playingalong1y ago

I've got 2/15, so worse then random choice... I guess partly because English is not my mother tongue.

ZoomerCretin1y ago

> 8. All of local politics in the muni I live in takes place in a forum like this, on Facebook[.] The electeds in our muni post on it; I've gotten two different local laws done by posting there (and I'm working on a bigger third); I met someone whose campaign I funded and helped run who is now a local elected. It is crazy to think you can HN-effortpost your way to changing the laws of the place you live in but I'm telling you right now that you can.

This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.

I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.

xanderlewis1y ago

I feel like I recognise the comment about tensors from HN a few days ago, haha.

card_zero1y ago

The LLMs are better than me at knowing the finer probabilities of next words, and worse than me at guessing the points being made and reasoning about that.

efilife1y ago

Tried to respond like a LLM would

> You scored 7/15. The best language model, mistral-7b, scored 7/15.

I guess it's a success

nyrikki1y ago

7/10 This is more about set shattering than 'smarts'

LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.

You can unroll and cyclic graph into a dag, but you constrict the solution space.

Take the 'spoken': sentence:

"I never said she stole my money"

And say it multiple times with emphasis on each word and notice how the meaning changes.

That is text being a forgetful functor.

As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.

But that is existential quantification, limited based on your corpus based on pattern matching and finding.

I guess if "Smart" is defined as pattern matching and finding it would apply.

But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.

Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.

This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.

lemoncookiechip1y ago

you: 6/15 (336sec)

gpt-4o: 5/15

gpt-4: 5/15

gpt-4o-mini: 5/15

llama-2-7b: 6/15

llama-3-8b: 6/15 (Slowest Bot: 14sec)

mistral-7b: 5/15

unigram: 2/15

fidla1y ago

Yes definitely

drakonka1y ago

you: 5/15

gpt-4o: 5/15

gpt-4: 5/15

gpt-4o-mini: 4/15

llama-2-7b: 7/15

llama-3-8b: 7/15

mistral-7b: 7/15

unigram: 4/15

lingualscorn1y ago

The only ones I got right were ones where I had read the actual HN comment…

EugeneOZ1y ago

Just proves why IQ tests are worthless.

j / k navigate · click thread line to collapse

103 comments

85 comments · 47 top-level

jsnell1y ago· 6 in thread

It's a neat idea, though not what I expected from the title talking about "smart" :)

You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.

codetrotter1y ago

> not what I expected from the title talking about "smart"

I think the title is mainly a reference to the TV show “Are you smarter than a fifth grader?”

JoelEinbinderOP1y ago

If you want to practice it one question at at time, you set the question count to 1. https://joel.tools/smarter/?questions=1

When I tested it this way it resulted in less of an emotional reaction.

lupire1y ago

I retired as worldwide champion (tied) of text prediction.

  you: 0/1
  gpt-4o: 0/1
  gpt-4: 0/1
  gpt-4o-mini: 0/1
  llama-2-7b: 0/1
  llama-3-8b: 0/1
  mistral-7b: 0/1
  unigram: 0/1

1 more reply

KTibow1y ago

If you're looking for "knowledge" try https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

j_bum1y ago

This is fun!

I bet this could be a unique testing resource for aspiring Jeapordy contestants.

dang1y ago

Thanks - we've LLMified the title.

stackghost1y ago· 3 in thread

I don't see what this has to do with being "smarter" than anything. Example:

1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____

a) ether

b) a

c) the

d) more

JoelEinbinderOP1y ago

The LLM didn’t generate the next word. Hacker News commenters did. You can see the source of the comment on the results screen.

sigbottle1y ago

Do LLM's generate words on the fly or can they sort of "go back" and correct themselves? stackghost brought up a good point I didn't think about before

2 more replies

DiscourseFan1y ago

I propose you do the same things, but only include HN content from before the existence of LLMs. That should ensure there is no bias towards any of the models.

2 more replies

Kiro1y ago· 3 in thread

Where do the incorrect options come from?

manuelmoreale1y ago

In another comment the author wrote

> I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments

So I guess the correct answer comes from the HN user who wrote the comment?

Kiro1y ago

Yeah, but I was wondering about the incorrect options.

lupire1y ago

I suspect they come from the LLMs.

JoelEinbinderOP1y ago· 2 in thread

jonahx1y ago

"This exercise helped me to understand how language models work on a much deeper level."

I'd like to hear more on this.

0xDEADFED51y ago

It's an interesting test, pretty cool idea. Thanks for sharing

chmod7751y ago· 2 in thread

    you: 4/15
    gpt-4o: 0/15
    gpt-4: 1/15
    gpt-4o-mini: 2/15
    llama-2-7b: 2/15
    llama-3-8b: 3/15
    mistral-7b: 4/15
    unigram: 1/15

Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.

If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.

Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?

Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.

JoelEinbinderOP1y ago

On the full set of 1000 questions, the language models are getting 30-35% correct. With patience, humans can do 40-50%.

exit1y ago

i'm curious, how did you arrive at "40-50%" possible human performance?

the task of "predicting the next word" can be understood as either "correctly choosing the next word in the hidden context", or "predicting the likelihood of each possible word".

the quiz is evaluating against the former, but humans are still far from being able to express a percentile likelihood for each possibility.

i only consciously arrive at a vague feeling of confidence, rather than being able to weigh the prediction of each word with fractional precision.

one might say that LLMs have above human introspective ability in that regard.

nojs1y ago· 2 in thread

This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.

layer81y ago

This was always my strategy for Who Wants to Be a Millionaire?. Pick the answer that would seem the most unlikely to be listed if any of the other three answers were the correct one.

JoelEinbinderOP1y ago

What scores are you getting using this technique?

silisili1y ago· 2 in thread

JoelEinbinderOP1y ago

silisili1y ago

akira25011y ago· 2 in thread

Yes. I can tell you about things that happened this morning. Your language model cannot.

manuelmoreale1y ago

I can also invite you out for a coffee and your LLM can’t do that either–yet.

Squeeze26641y ago

They're perfectly capable of inviting you out for coffee. They just can't show up yet.

2 more replies

moralestapia1y ago· 2 in thread

>the quintessential language model task of predicting the next word?

Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.

sorokod1y ago

The one provided in the original post.

moralestapia1y ago

I don't see any of that.

Quote?

1 more reply

lostmsu1y ago· 2 in thread

JoelEinbinderOP1y ago

lostmsu1y ago

1. perhaps even out of variants generated by other LLMs

mjcurl1y ago· 2 in thread

5/15, so the same as choosing the most common word.

I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.

dalton011y ago

It says choosing the most common word was just 1/5 (and their best LLM was 4/15)

toxik1y ago

Yeah, it should be sentences that have low next token distribution entropy. Where an LLM is sure what the next word is. I bet people do real well on those too. By the way, I also had 5/15.

modeless1y ago· 1 in thread

> You scored 11/15. The best language model, llama-2-7b, scored 10/15.

EvgeniyZh1y ago

Given the high scores, I guess it was an easy one. I've taken the longer one, and got the following

> You scored 28/100. The best language model, gpt-4, scored 32/100. The unigram model, which just picks the most common word without reading the prompt, scored 28/100.

Assuming complexity averages out on N=100, small test with LLM score above ~5 is "easy"

anikan_vader1y ago· 1 in thread

Got 8/15, best AI model got 7/15, and unigram got 1/15.

Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.

sethammons1y ago

I have wasted an inordinate amount of time hn. i scored 2/15

zoklet-enjoyer1y ago· 1 in thread

You scored 6/15. The best language model, gpt-4o, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 2/15.

Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!

e12e1y ago

    you: 8/15
    gpt-4o: 2/15
    gpt-4: 4/15
    gpt-4o-mini: 4/15
    llama-2-7b: 5/15
    llama-3-8b: 5/15
    mistral-7b: 6/15
    unigram: 5/15

> You scored 8/15. The best language model, mistral-7b, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 5/15.

(In I think 120 seconds - didn't copy that part).

Interesting that results differ this much between runs (for the LLMs).

Surely someone did better than me on their first run?

Ed: I wonder if the human scores correlate with age of hn account?

greesil1y ago· 1 in thread

Like a ML model I would prefer being scored with cross entropy and not right/wrong. Like, I might guess wrong but it might not be that far off in likelihood.

kqr1y ago

It is mitigating that we get so many questions, but I agree it's inefficient. As a human forecaster I also prefer being judged in part on my confidence in each of the alternatives.

fsndz1y ago· 1 in thread

Of course not, but that does not mean LLMs will lead to AGI. We might never build AGI in fact: https://www.lycee.ai/blog/why-no-agi-openai

shkkmo1y ago

That article, disapointingly, doesn't provide any arguments as to why we can't build AGI.

StefanBatory1y ago· 1 in thread

7/15, 90 seconds. I'll blame it on fact that I'm not English native speaker, right? Right?

On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.

seabass-labrax1y ago

I am a native English speaker and only got 5/15 - and it took me over 100 seconds. You have permission to bask in the glory of your superiority over both GPT4 and your fellow HN readers!

globular-toast1y ago· 1 in thread

Everything I picked was grammatically correct, so I don't see the point. Is the point of a "language model" just to recall people's comments from the internet now?

tmalsburg21y ago

Always has been.

rlt1y ago· 1 in thread

Is this with the “temperature” parameter set to 0? Most LLM chatbots set it to something higher.

It would be interesting to try varying it, as well as the seed.

JoelEinbinderOP1y ago

lelanthran1y ago· 1 in thread

This is a nonsense test. There is no context, so the 'next' word after the single word 'The' is effectively random.

I'm pretty certain that LLMs are unable to work at all without context.

nmstoker1y ago

They will "work", ie give a prediction, it's simply that it will have a pretty low probability of being the correct answer, which is a consequence of the highly limited context.

IMHO that doesn't make it nonsense, but maybe you are reading something different into the purpose of this test to what I am.

User231y ago· 1 in thread

With some brief experimentation ChatGPT also fails this test.

lostmsu1y ago

It might make sense: any kind of fine-tuning of LLMs usually reduces generalization capabilities, and instruction-tuning is a kind of fine-tuning.

layer81y ago

This is also a good test for noticing that you spend too much time reading HN comments.

moritzwarhier1y ago

This is the best interactive website about LLMs at a meta level (so excluding prompt interfaces for actual AIs) that I've seen so far.

Quizzes can be magical.

Haven't seen any cooler new language-related interactive fun-project on the web since:

https://wikispeedruns.com/

It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.

Sharing this with a general audience could spark funny discussions about bubbles and biases :)

RheingoldRiver1y ago

I don't quite understand, what makes "Okay I've" more correct than "Okay so"? No meaningful context was provided here, how do we know "Okay I've" was at all meaningfully correct?

Also, I got bored halfway through and selected "D" for all of them

pizza1y ago

If the samples came from HN, I wonder how likely it is that the text is already a part of a dataset (ie common crawl snapshot) so that the LLMs have already seen them?

jdthedisciple1y ago

Some of them are excerpts from a much larger context, which the LLM would be using for prediction, obviously giving them a gigantic edge.

Garlef1y ago

I like it. It's a humorous reversal of the usual articles that boil down to "Look! I made the AI fail at something!"

TacticalCoder1y ago

My computer can compute 573034897183834790x3019487439184798 in less than a millisecond. Doesn't make it smarter than me.

ChrisArchitect1y ago

Who's Smarter: AI or a 5-Year-Old?

https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/

(https://news.ycombinator.com/item?id=41263363)

kqr1y ago

dataflow1y ago

I got 9/15, vs. 4/15 for an LLM. I assume these are lifted from HN? Seems like an indication I should spend less time here...

blitzar1y ago

I took some mushrooms and hallucinated the answers.

nick34431y ago

This isn't really the challenge (loss function) that language models are trained on. It's not a simple next-word challenge, they get more context, see how BERT was trained as a reference.

shakna1y ago

So... If I picked the same results, in the same timeframe... And I don't think glue should go on pizza... Does that mean LLMs are completely useless to me?

lupire1y ago

I got one of my own comments on the 15 question quiz!

wesselbindt1y ago

playingalong1y ago

I've got 2/15, so worse then random choice... I guess partly because English is not my mother tongue.

ZoomerCretin1y ago

I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.

xanderlewis1y ago

I feel like I recognise the comment about tensors from HN a few days ago, haha.

card_zero1y ago

The LLMs are better than me at knowing the finer probabilities of next words, and worse than me at guessing the points being made and reasoning about that.

efilife1y ago

Tried to respond like a LLM would

> You scored 7/15. The best language model, mistral-7b, scored 7/15.

I guess it's a success

nyrikki1y ago

7/10 This is more about set shattering than 'smarts'

LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.

You can unroll and cyclic graph into a dag, but you constrict the solution space.

Take the 'spoken': sentence:

"I never said she stole my money"

And say it multiple times with emphasis on each word and notice how the meaning changes.

That is text being a forgetful functor.

As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.

But that is existential quantification, limited based on your corpus based on pattern matching and finding.

I guess if "Smart" is defined as pattern matching and finding it would apply.

But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.

Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.

This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.

lemoncookiechip1y ago

you: 6/15 (336sec)

gpt-4o: 5/15

gpt-4: 5/15

gpt-4o-mini: 5/15

llama-2-7b: 6/15

llama-3-8b: 6/15 (Slowest Bot: 14sec)

mistral-7b: 5/15

unigram: 2/15

fidla1y ago

Yes definitely

drakonka1y ago

you: 5/15

gpt-4o: 5/15

gpt-4: 5/15

gpt-4o-mini: 4/15

llama-2-7b: 7/15

llama-3-8b: 7/15

mistral-7b: 7/15

unigram: 4/15

lingualscorn1y ago

The only ones I got right were ones where I had read the actual HN comment…

EugeneOZ1y ago

Just proves why IQ tests are worthless.

j / k navigate · click thread line to collapse