You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.
First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.
I think the title is mainly a reference to the TV show “Are you smarter than a fifth grader?”
Fittingly then, is the fact that a lot of types of questions that they were asking in that TV show was mostly trivia. Which I also don’t think of as being a particularly important characteristic of being “smart”.
When I think of “smart” people, I think of people who can take limited amount of information and connect dots in ways that others can’t. Of course it also builds on knowledge. You need to have specific knowledge in the first place to make connections. But knowing facts like “the battle of so and so happened on August 18th 1924, one hundred years ago today” alone is not “smart”. A smart person is someone who uses knowledge in a surprising way. Or in a way that others would not have been able to. After the smart person made the connection others might also go like “oh that’s so obvious why didn’t I think about that” or even “yeah that’s really obvious, I could’ve thought of that too”. And yet the first person to actually make, and properly communicate that connection was the smart one. Smart exactly because they did.
When I tested it this way it resulted in less of an emotional reaction.
you: 0/1
gpt-4o: 0/1
gpt-4: 0/1
gpt-4o-mini: 0/1
llama-2-7b: 0/1
llama-3-8b: 0/1
mistral-7b: 0/1
unigram: 0/1I bet this could be a unique testing resource for aspiring Jeapordy contestants.
I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.
I'd like to hear more on this.
you: 4/15
gpt-4o: 0/15
gpt-4: 1/15
gpt-4o-mini: 2/15
llama-2-7b: 2/15
llama-3-8b: 3/15
mistral-7b: 4/15
unigram: 1/15
Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.
Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?
Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.
The language models were prompted with the text + each candidate answer, and the one with the lowest perplexity was picked. I tried to avoid instruction tuned models wherever possible to avoid the "voice" problem.
the task of "predicting the next word" can be understood as either "correctly choosing the next word in the hidden context", or "predicting the likelihood of each possible word".
the quiz is evaluating against the former, but humans are still far from being able to express a percentile likelihood for each possibility.
i only consciously arrive at a vague feeling of confidence, rather than being able to weigh the prediction of each word with fractional precision.
one might say that LLMs have above human introspective ability in that regard.
This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.
I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.
> You scored 28/100. The best language model, gpt-4, scored 32/100. The unigram model, which just picks the most common word without reading the prompt, scored 28/100.
Assuming complexity averages out on N=100, small test with LLM score above ~5 is "easy"
Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.
Quizzes can be magical.
Haven't seen any cooler new language-related interactive fun-project on the web since:
It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.
Sharing this with a general audience could spark funny discussions about bubbles and biases :)
For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?
Also, I got bored halfway through and selected "D" for all of them
edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.
Who's Smarter: AI or a 5-Year-Old?
I don't see what this has to do with being "smarter" than anything. Example:
1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____
a) ether
b) a
c) the
d) more
But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?
I propose you do the same things, but only include HN content from before the existence of LLMs. That should ensure there is no bias towards any of the models.
> I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments
So I guess the correct answer comes from the HN user who wrote the comment?
Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!
you: 8/15
gpt-4o: 2/15
gpt-4: 4/15
gpt-4o-mini: 4/15
llama-2-7b: 5/15
llama-3-8b: 5/15
mistral-7b: 6/15
unigram: 5/15
> You scored 8/15. The best language model, mistral-7b, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 5/15.(In I think 120 seconds - didn't copy that part).
Interesting that results differ this much between runs (for the LLMs).
Surely someone did better than me on their first run?
Ed: I wonder if the human scores correlate with age of hn account?
Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.
This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.
I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.
On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.
1. perhaps even out of variants generated by other LLMs
I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.
It would be interesting to try varying it, as well as the seed.
> You scored 7/15. The best language model, mistral-7b, scored 7/15.
I guess it's a success
I'm pretty certain that LLMs are unable to work at all without context.
IMHO that doesn't make it nonsense, but maybe you are reading something different into the purpose of this test to what I am.
LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.
You can unroll and cyclic graph into a dag, but you constrict the solution space.
Take the 'spoken': sentence:
"I never said she stole my money"
And say it multiple times with emphasis on each word and notice how the meaning changes.
That is text being a forgetful functor.
As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.
But that is existential quantification, limited based on your corpus based on pattern matching and finding.
I guess if "Smart" is defined as pattern matching and finding it would apply.
But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.
Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.
This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.
gpt-4o: 5/15
gpt-4: 5/15
gpt-4o-mini: 5/15
llama-2-7b: 6/15
llama-3-8b: 6/15 (Slowest Bot: 14sec)
mistral-7b: 5/15
unigram: 2/15
gpt-4o: 5/15
gpt-4: 5/15
gpt-4o-mini: 4/15
llama-2-7b: 7/15
llama-3-8b: 7/15
mistral-7b: 7/15
unigram: 4/15