PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models (opens in new tab)

(arxiv.org)

174 pointsenum1y ago80 comments

80 comments

60 comments · 14 top-level

rahidz1y ago· 13 in thread

What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.

Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."

Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."

Workaccount21y ago

The other day I watched Gemini reasoning go in circles for ~30 seconds trying to determine if 9.11 was greater than 9.9.

----

"Realize a direct comparison of tenths is not enough: Wait, this is not right. In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. This initial comparison is incorrectly stated. Let's re-examine.

Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.

Re-examine the numbers digit by digit from left to right, after the decimal:

9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).

9.11: 1 in the tenths place, 1 in the hundredths place.

Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9. In 9.11, the tenths digit is 1. Wait... something is still wrong with my comparison logic. I'm getting confused. Let's re-read the numbers again.

Correctly read and compare tenths digits:

9.9 - Tenths digit is 9.

9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."

----

This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.

hto2i4j2342341y ago

I wonder how much of this arithmetic confusion comes from software versioning numbers where the decimal character is actually a tuple-delimiter.

1 more reply

fzzzy1y ago

Training a model to reason that it should write and execute code is the solution here. Synthetic datasets showing lots of math questions followed by reasoning that writes and executes the program for the answer.

1 more reply

Validark1y ago

Obviously very stupid reasoning going on, but reasoning nonetheless? It makes me think we're on the right track that it basically seems to know what steps should be taken and how to step through the steps. I don't know why it is getting so incredibly tripped up, maybe it's extremely uncertain about whether it can map "9.9"["tenths place"] => "9". But this is still impressive to me that a machine is doing this.

empath751y ago

This and it's struggles with spelling questions are both artifacts of tokenization and not really a failure of reasoning. I think there's probably a simple solution that solves both this and the "how many r's are there in strawberry" problem, though I don't know what it would be.

photonthug1y ago

This is hilarious, and makes me wonder whether there’s some main place where people are archiving examples of AI fails now. It would be amusing, but also seems like a public service and might help to avoid billions of dollars getting burnt at the altar of hype.

1 more reply

sd91y ago

I wonder if RLHF interfered with 9.11 (which could be interpreted as a date), preventing the model from considering it naturally.

Wonder if the same thing would have happened with 9.12.

What was your original prompt?

1 more reply

armcat1y ago

It feels like lot of the reasoning tokens go to waste on pure brute force approach - plugging in numbers and evaluating and comparing against the answer. "Nope, that didn't work, let's try 4 instead of 6 this time", etc. What if the reward function instead focuses on diversity of procedures within a token budged (10k - 20k tokens). I.e. RL rewards the model in trying different methods or generating different hypotheses, rather than brute forcing its way through, and potentially getting stuck in loops.

ANighRaisin1y ago

I would say that diversity isn't something that's easy to reenforce, but I do think it will occur as a natural consequence of optimizing for shorter chains of thought according to a wide variety of problems. Of course, the nature of the data may lead it to do brute force, but that can be fixed with clever fine tuning.

2 more replies

enumOP1y ago

The nature of the problems makes it relatively easy to follow along with the models' reasoning and reasoning errors. For example, on this problem (answer "New England"):

> Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?

R1 fixates on answers of the form "CITY, STATE" and eventually returns some confidently wrong nonsense. It doesn't try to explore answers that don't fit the "CITY, STATE" template.

empath751y ago

o1 high got this after i gave it the hint that the first five letters are not a single word.

viraptor1y ago

I hope the new models will be trained with better words to continue the thought process. Right now it seems like "wait", "but", "let me think again", etc. are the main ones, which seem to encourage self-doubt too much. They need some good balance instead.

dr_kiszonka1y ago

It would be fun to experiment with, e.g., positive self-talk like "you've got this", "you've trained for this," etc.

mkoubaa1y ago· 8 in thread

The term "PhD Knowledge" makes me cringe. PhD is about contribution to a field, not a standardized test score or a weird status-bearing substitute for IQ.

glial1y ago

Also - at least in my experience - doing a PhD makes a person more aware of the boundaries of knowledge, and therefore generally pretty skeptical, and capable of being usefully critical, about developments in their field.

enumOP1y ago

There is a trend of building hard benchmarks for LLMs, such as GPQA or Humanity's Last Exam, with questions crafted by PhDs. Ergo, most people cannot understand what these benchmarks are testing. If you're an expert in something, and if you're on HN you may be an expert in some sub-field of CS, I recommend trying to solve the questions from your field on Humanity's Last Exam. I took a crack at the Programming Languages questions myself.

charlieyu11y ago

- wants PhD knowledge - pays like $100-500 per question

krisoft1y ago

In this case it is a shorthand for specialist level knowledge. Thre kind of information you would expect a person pursuing a PhD in a specific field to know but someone not doing that to not know.

For example how in the “Humanity's Last Exam” AI evaluation dataset they have this question: “Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.”

That’s the kind of “PhD knowledge” which this evaluation does not require.

mkoubaa1y ago

That's a jargon and search problem at best. I'm not convinced at all.

1 more reply

levocardia1y ago

You cannot get a PhD (in the US) without also passing an examination that, nominally at least, demonstrates you have broad and deep knowledge about your field. The qualifying exam is not the most important part, but it is a necessary part.

enumOP1y ago

Some still exist, but this is less and less true over time. Even when a requirement is called an "exam", it isn't what a high-schooler would recognize as one.

Many of the exams described on this page are personalized and only require mastery of a handful of papers that really matter to the individual student, and knowledge of related work:

https://www.cs.stanford.edu/phd-program-requirements-qualify...

I've been through something that sounds similar to these, and I would not have called it an exam.

buescher1y ago

It means you have taken graduate level courses and have completed an apprenticeship in doing publishable academic research. Now that a master's is the new bachelor's there's something of a trend to think of a PhD as the new master's but it's really qualitatively different.

XCabbage1y ago· 5 in thread

This doesn't feel like a "reasoning" challenge. The mental skill required to solve most of these seems to be the ability to loop over all known members of a category like "popular brand names" or "well-known actors" and see if they fit the clue.

As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.

Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).

(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)

jkhdigital1y ago

I had the same thought. It reminds me of solving Project Euler problems, where there is often an obvious naive approach which is guaranteed to produce the correct answer but would consume prohibitive memory/compute resources to execute to completion. I suspect the models would perform much better if prompted to formulate a strategy for efficiently solving these challenges rather than solving them directly… which indicates a direction for potential improvement I suppose.

tkgally1y ago

I agree that recall seems to play an important role in solving these problems. Similar to how the ARC-AGI problems seem to depend on visual perception of shapes and colors. When I come up with the correct answers to such puzzles, I feel subjectively that the answers flashed into my mind, not that I reasoned my way to them.

szundi1y ago

Your base model is good then

enumOP1y ago

It's definitely U.S. centric.

But, I do think this is reasoning. It requires recall, but anything other than pure logic puzzles do. For example, on a competition math problem or a programming problem, No person or LLM is inventing well-known lemmas and algorithms from first-principles.

XCabbage1y ago

It's not just that it requires recall. It's that it requires no non-trivial thought beyond recall.

1 more reply

windsignaling1y ago· 4 in thread

I'm not a fan of these "gotchas" because they don't test for what we really care about.

Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.

I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.

IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.

michaelt1y ago

The problem is some of the "gotchas" seem rather important in nontrivial applications.

Imagine a model that isn't sure if 9.11 is greater than 9.9 - which is difficult to reason about, because tokens.

Could such a model coach kids in math? Could it proofread a paper, or sense-check a business plan? Could it summarise a long document about carbon emissions? Could it generate a GUI? Could it spot mistakes in an OCRed document? Spot an off-by-one error or divide-by-zero in computer code?

aprilthird20211y ago

The gotchas are good to help outline where the risk is when using these models. What you and I care about might change and one day counting letters in strings or solving trivia puzzles may be something we care about. It's nice to know the fuzzy edges of the system we are relying on day to day.

In fact, your final statement that these are tasks software should do rather than LLMs, is only proven to more people and made more clear by the prominence of these "gotchas"

enumOP1y ago

The problems are not important, but they illustrate failures that are. For example:

- The paper has an example where the model reasons "I'm frustrated" and then produces an answer that it "knows is wrong". You wouldn't know it if you didn't examine the reasoning tokens.

- There are two examples were R1 often gets stuck "thinking forever"

If these failures happen on these questions, where else can happen? We'll start to find out soon enough.

Workaccount21y ago

Someone needs to make a data transformation benchmark.

"Here are a variety of personal documents about John Doe. Fill out the McDonalds job application with information retrieved from the document set."

lokimedes1y ago· 4 in thread

As if the whole anti-intellectual hunt wasn’t enough, now PhD is a category implying holder of rote-knowledge at the highest level. I guess it is hopeless to fight this, but a PhD is 100x more about the apprenticeship and real-world training as a scientist than any accumulated knowledge beyond ones prior training.

I know this is a rant, sorry, just so tired of the stupidity.

DiscourseFan1y ago

PhD students aren't necessarily the brightest (which is probably why many of them don't end up with tenure), but studying something at the graduate level does open your eyes to a particular kind of knowledge which is generally unavailable otherwise. So, yes, of course the work is more important than the credential, but the credential often stands in for the work that gives that knowledge.

klipt1y ago

> which is probably why many of them don't end up with tenure

There are far fewer tenure track positions than PhD's, so most of them not ending up with tenure is tautologically guaranteed.

1 more reply

mlinhares1y ago

There’s no multibillion dollar business if they can’t say everyone will be replaced by their models.

atrettel1y ago

This is also a bit of a pet peeve of mine too. I surely learned a lot of "knowledge" during my PhD, but really what makes me an expert in my subject is less my ability to rattle off facts but

1. my ability to reason and understand the subject after a lot of experience working in it, and

2. the skills that I picked up during my PhD and afterwards.

I can really only "know" so much, but the skills that I developed help me get the right information without having to "know" it directly --- either by searching the literature, running a calculation or simulation, or performing an experiment. There are a lot of things that I have difficulty remembering but I only "know" them because I remember how to derive them and where they come from, etc. That's one of the reasons that I don't like the assertion that this model has ingested all of this knowledge and it is now an expert, because expertise goes beyond having knowledge. It's about developing a deeper understanding of a subject that only comes from actively engaging with it for a long time.

1 more reply

enumOP1y ago· 3 in thread

Results and dataset explorer here: https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...

mh-1y ago

For ID=3, it shows o1 getting it wrong, but it seems to have succeeded? It did add a space between Tinker and bell, but that is the canonical way of spelling the character apparently.

(That just one caught my attention because I was curious what challenge o1-mini got correct that o1 did not.)

enumOP1y ago

Thanks, fixed. (Spaces rebuilding.) We have manually combed labelled-wrong answers and tweaked the predicates that check correctness. Sorry we missed this one.

1 more reply

ks20481y ago

Is there some standard markup assumed for LLMs? I just looked at a couple and saw markdown-like text (which I expected), but also "$\boxed{foo}$". Looks like LaTeX, but is there some subset of LaTeX LLMs are assumed to know/use?

zinccat1y ago· 3 in thread

My feeling is that a lot of challenge could come from the tokenizer used by the model, similar to r in strawberry problem.

enumOP1y ago

There is likely a little bit of that, but I don't think that's the primary factor that makes these hard. If you look at the reasoning performed by R1 ("Show Thoughts" checkbox), it has no problem splitting and joining candidate words into letters, or even doing the letter-level manipulations. Just to give an example:

Take a familiar brand name, seen along roads and highways in the United States. It has five letters, two syllables. The first syllable, phonetically, is a word that is the opposite of the word spelled by the second syllable. What brand name is it?

<think> Okay, let me try to figure out this riddle. The question is about a brand name that's familiar on roads and highways in the US. It has five letters and two syllables. The key part is that the first syllable phonetically is the opposite of the second syllable spelled out. Hmm.

First, I need to think of common brands seen on highways. Gas stations, fast food, hotels, maybe car brands. Five letters, two syllables. Let's list some possibilities: Exxon, Shell, Chevron, Mobil, BP (but BP is two letters, so probably not). Maybe food places like Subway, but that's six letters. McDonald's is too long. Burger King? No. Maybe hotels: Motel 6? No, that's a number. Holiday Inn? Too long. Maybe Best Western? No, too long.

Wait, five letters. Let's think of two-syllable brands. Exxon (E-xxon), but Exxon is five letters. First syllable is "Ex", second is "xon". But what's the opposite of "xon"? That doesn't make sense. Maybe the first syllable is a word, and the second syllable when spelled is the opposite. So, for example, if the first syllable is "up", the second syllable's letters spell "down". But "updown" isn't a brand. Wait, the first syllable phonetically is a word, and the second syllable's spelling is the opposite. So when you say the first syllable, it's a word, and the letters of the second syllable spell its opposite.

So, for example, if the brand is "Updn" (made up), where "up" is the first syllable, and "dn" spells "down". But that's not a real brand. Let's think of real brands. Maybe "Comcast"? No, too long. "Verizon"? Also longer. "Sprint"? Five letters. Sprint. S-print. But "S" doesn't mean anything opposite of "print". Hmm. Maybe "Exxon" again. "Ex" as in former, and the opposite would be "xon" spelled as... no.

Wait, maybe the first syllable is a word that's the opposite of the word formed by the second syllable's letters. For instance, "Sunoco" is a gas station, but that's six letters. "Shell" is five letters but one syllable. "Hertz" is car rental. Five letters, one syllable. "Avis"? Four letters. "FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx. The first syllable is "Fed", the second is "Ex". Now, "Fed" as in past tense of feed, but maybe "Fed" as in federal. The opposite of "Ex" could be "current" or "present". But that doesn't seem right. Alternatively, "Ex" as in former, so the opposite would be "current" or "new". Hmm, not sure. Maybe "FedEx" isn't the answer.

Another thought: "Amtrak" is a brand on highways, but that's six letters. "Greyhound"? Too long. "Dollar" car rental? Five letters? No, six. "Thrifty" is seven. Wait, maybe "Lyft"? No, four letters. "U-Haul" is hyphenated.

Wait, let's think of opposites. The first syllable's sound is a word, and the second syllable's spelling is its opposite. So, for example, if the brand is "Inote", where "I" (pronounced "eye") and "note", but "eye" and "note" don't have opposites. Alternatively, "Hi-Lo", but that's hyphenated. Maybe "Dayton" (a tire company?), but "Day" vs "ton" – not opposites.

Alternatively, maybe "Duracell" – no, too long. "Energizer"? No. "Eveready"? No. Wait, maybe "Goodyear". That's a tire company. Good-year. "Good" and "year". Not opposites.

Wait, the key is that the first syllable phonetically is a word, and the second syllable's spelling is the opposite. So, for instance, if the first syllable is "pro", the second syllable's letters spell "con". But "procon" isn't a brand. Or "up" and "down". Maybe "Updown", but not a real brand.

Wait, the brand "Exxon" – "Ex" and "xon". If "Ex" is the opposite of "xon" spelled as "xon" – no. Maybe "Ex" (as in former) and "xon" – but "xon" isn't a word. Alternatively, "Exxon" could be broken into "Ex" and "on". Wait, "Ex" (as in former) and "on" (the opposite being "off"). But "Exxon" is E-X-X-O-N, so the second syllable is "xon", not "on". Hmm.

Another approach: think of opposites like "yes/no", "hot/cold", "fast/slow", "high/low", "in/out", "on/off", "start/stop", "day/night". Maybe the brand incorporates one of these.

"Hi-Lo" is hyphenated. "Stop & Go" – no. "Speedpass" – not five letters. "7-Eleven" – no.

Wait, "Loews" hotels? Five letters. "Lo-ews". "Lo" as in low, and "ews" – not an opposite. Maybe not.

Wait, "Lucent" – a telecom company, but that's six letters. "Verizon" – seven letters.

Wait, "Motel 6" – no, includes a number. "Super 8" – same.

Wait, "Shell" – five letters but one syllable. "Chevron" – two syllables but seven letters.

Wait, "Pepsi" – five letters, two syllables. "Pep" and "si". "Pep" as in energy, and "si" – not an opposite.

"Cola" – not a brand. "Coke" – one syllable.

Wait, "Adidas" – three syllables. "Nike" – two syllables but four letters.

Wait, "Reebok" – six letters. "Fila" – four.

Wait, maybe a car brand. "Ford" – one syllable. "Chevy" – two syllables but five letters? C-H-E-V-Y. Five letters, "Chev" and "y"? Not opposites.

Wait, "Honda" – five letters, two syllables: Hon-da. Not opposites. "Toyota" – too long. "Lexus" – two syllables but five letters. "Le-xus". Not opposites.

Wait, "Nissan" – two syllables, six letters. "Mazda" – two syllables, five letters. M-A-Z-D-A. "Maz" and "da". Doesn't fit.

Wait, "Dodge" – one syllable. "Jeep" – one.

Wait, "Subaru" – three syllables.

Hmm. Let's think of gas stations again. "Shell", "Exxon", "Mobil", "Texaco", "Chevron", "BP", "Valero", "Citgo".

Citgo: C-I-T-G-O. Two syllables: Cit-go. "Cit" (like "sit") and "go". Wait, "sit" and "go" – those are opposites in a way. If you sit, you're not moving; if you go, you're moving. So "Citgo" – "Cit" phonetically is "sit", and "go" is spelled as "go". So the first syllable sounds like "sit", which is the opposite of "go". That might be it!

Citgo is a gas station brand, seen on highways, five letters (C-I-T-G-O), two syllables: Cit-go. Phonetically, "Cit" sounds like "sit", and "go" is the opposite. So the first syllable (phonetically "sit") is the opposite of the second syllable spelled "go". That seems to fit.

Let me double-check. The brand is Citgo. Five letters, two syllables. First syllable "Cit" pronounced like "sit", which is the opposite of "go". Yes, that makes sense. I think that's the answer. </think>

The brand name is *Citgo*.

Citgo is a well-known gas station brand commonly seen along U.S. roads and highways. It has five letters and two syllables: *Cit* (pronounced like "sit") and *go*. The first syllable phonetically represents "sit," which is the opposite of the second syllable spelled as "go." This clever play on opposites fits the riddle's criteria perfectly.

rthnbgrredf1y ago

I'm still not convinced that this isn't a tokenizer issue.

Were you able to find a substantial number of questions that do not fall into the letter countinh or word shuffling domsin - problems that are clearly unrelated to the fundamental tokenizer issue of modern LLMs? Otherwise, I would argue that your paper simply proves that the issue still exists.

1 more reply

zinccat1y ago

I see, but still there's a lot of reasonings just for counting the letters. And ridiculous reasonings like:

FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx.

1 more reply

brokensegue1y ago· 2 in thread

Are these really reasoning challenges? Seems like they are really solved via brute force or guess and check

enumOP1y ago

Heuristic search, not exhaustive search, is an essential ingredient of reasoning. Has been true since chess. Remains true with MCTS, LLMs, etc.

brokensegue1y ago

What heuristics are there for some of these?

aghilmort1y ago· 2 in thread

really great work! are you a co-author?

enumOP1y ago

Thanks, and yes :)

aghilmort1y ago

excellent - will be citing in next arxiv paper - may have questions, @dotdotjames on twitter if want to connect

sega_sai1y ago· 1 in thread

Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...

enumOP1y ago

They are scraped from the web, and discussed on Reddit. So, they are definitely in the training data. Despite that, the non-reasoning LLMs struggle to solve them.

There are however new problems each week, and released every week. So, we can safely assume the latest problems are decontaminated. It remains to be seen if and how performance drops on the problems released in 2025. (Not enough problems yet to tell.)

scotty791y ago· 1 in thread

If you want to have a problem that is fairly easy for humans but hard for LLMs it should have solution that requires iteratively applying same steps few times. Perhaps conditionally. I predict that LLMs even in chain-of-thought should drop the ball after just few iterations.

enumOP1y ago

Sure. Thats not the goal here. These problems are hard for humans and (mostly) hard for LLMs.

zone4111y ago

I have a set of independent benchmarks and most also show a difference between reasoning and non-reasoning models:

LLM Confabulation (Hallucination): https://github.com/lechmazur/confabulations/

LLM Step Game: https://github.com/lechmazur/step_game

LLM Thematic Generalization Benchmark: https://github.com/lechmazur/generalization

LLM Creative Story-Writing Benchmark: https://github.com/lechmazur/writing

Extended NYT Connections LLM Benchmark: https://github.com/lechmazur/nyt-connections/

and a couple more that I haven't updated very recently.

akomtu1y ago

The reasoning challenge is made of two parts:

1. Can you apply an existing model to a problem? For example: you're told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.

2. Can you come up with a model that explains the given examples? For example: you're given 10 triples like AxB=C and asked to explain what they have in common.

Simply imitating answers won't get you very far.

bryan01y ago

Are LLMs not trained on NPR transcripts?

j / k navigate · click thread line to collapse

80 comments

60 comments · 14 top-level

rahidz1y ago· 13 in thread

What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.

Workaccount21y ago

The other day I watched Gemini reasoning go in circles for ~30 seconds trying to determine if 9.11 was greater than 9.9.

----

Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.

Re-examine the numbers digit by digit from left to right, after the decimal:

9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).

9.11: 1 in the tenths place, 1 in the hundredths place.

Correctly read and compare tenths digits:

9.9 - Tenths digit is 9.

9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."

----

This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.

hto2i4j2342341y ago

I wonder how much of this arithmetic confusion comes from software versioning numbers where the decimal character is actually a tuple-delimiter.

1 more reply

fzzzy1y ago

1 more reply

Validark1y ago

empath751y ago

photonthug1y ago

1 more reply

sd91y ago

I wonder if RLHF interfered with 9.11 (which could be interpreted as a date), preventing the model from considering it naturally.

Wonder if the same thing would have happened with 9.12.

What was your original prompt?

1 more reply

armcat1y ago

ANighRaisin1y ago

2 more replies

enumOP1y ago

The nature of the problems makes it relatively easy to follow along with the models' reasoning and reasoning errors. For example, on this problem (answer "New England"):

R1 fixates on answers of the form "CITY, STATE" and eventually returns some confidently wrong nonsense. It doesn't try to explore answers that don't fit the "CITY, STATE" template.

empath751y ago

o1 high got this after i gave it the hint that the first five letters are not a single word.

viraptor1y ago

dr_kiszonka1y ago

It would be fun to experiment with, e.g., positive self-talk like "you've got this", "you've trained for this," etc.

mkoubaa1y ago· 8 in thread

The term "PhD Knowledge" makes me cringe. PhD is about contribution to a field, not a standardized test score or a weird status-bearing substitute for IQ.

glial1y ago

enumOP1y ago

charlieyu11y ago

- wants PhD knowledge - pays like $100-500 per question

krisoft1y ago

In this case it is a shorthand for specialist level knowledge. Thre kind of information you would expect a person pursuing a PhD in a specific field to know but someone not doing that to not know.

That’s the kind of “PhD knowledge” which this evaluation does not require.

mkoubaa1y ago

That's a jargon and search problem at best. I'm not convinced at all.

1 more reply

levocardia1y ago

enumOP1y ago

Some still exist, but this is less and less true over time. Even when a requirement is called an "exam", it isn't what a high-schooler would recognize as one.

Many of the exams described on this page are personalized and only require mastery of a handful of papers that really matter to the individual student, and knowledge of related work:

https://www.cs.stanford.edu/phd-program-requirements-qualify...

I've been through something that sounds similar to these, and I would not have called it an exam.

buescher1y ago

XCabbage1y ago· 5 in thread

(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)

jkhdigital1y ago

tkgally1y ago

szundi1y ago

Your base model is good then

enumOP1y ago

It's definitely U.S. centric.

XCabbage1y ago

It's not just that it requires recall. It's that it requires no non-trivial thought beyond recall.

1 more reply

windsignaling1y ago· 4 in thread

I'm not a fan of these "gotchas" because they don't test for what we really care about.

Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.

I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.

IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.

michaelt1y ago

The problem is some of the "gotchas" seem rather important in nontrivial applications.

Imagine a model that isn't sure if 9.11 is greater than 9.9 - which is difficult to reason about, because tokens.

aprilthird20211y ago

In fact, your final statement that these are tasks software should do rather than LLMs, is only proven to more people and made more clear by the prominence of these "gotchas"

enumOP1y ago

The problems are not important, but they illustrate failures that are. For example:

- The paper has an example where the model reasons "I'm frustrated" and then produces an answer that it "knows is wrong". You wouldn't know it if you didn't examine the reasoning tokens.

- There are two examples were R1 often gets stuck "thinking forever"

If these failures happen on these questions, where else can happen? We'll start to find out soon enough.

Workaccount21y ago

Someone needs to make a data transformation benchmark.

"Here are a variety of personal documents about John Doe. Fill out the McDonalds job application with information retrieved from the document set."

lokimedes1y ago· 4 in thread

I know this is a rant, sorry, just so tired of the stupidity.

DiscourseFan1y ago

klipt1y ago

> which is probably why many of them don't end up with tenure

There are far fewer tenure track positions than PhD's, so most of them not ending up with tenure is tautologically guaranteed.

1 more reply

mlinhares1y ago

There’s no multibillion dollar business if they can’t say everyone will be replaced by their models.

atrettel1y ago

This is also a bit of a pet peeve of mine too. I surely learned a lot of "knowledge" during my PhD, but really what makes me an expert in my subject is less my ability to rattle off facts but

1. my ability to reason and understand the subject after a lot of experience working in it, and

2. the skills that I picked up during my PhD and afterwards.

1 more reply

enumOP1y ago· 3 in thread

Results and dataset explorer here: https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...

mh-1y ago

For ID=3, it shows o1 getting it wrong, but it seems to have succeeded? It did add a space between Tinker and bell, but that is the canonical way of spelling the character apparently.

(That just one caught my attention because I was curious what challenge o1-mini got correct that o1 did not.)

enumOP1y ago

Thanks, fixed. (Spaces rebuilding.) We have manually combed labelled-wrong answers and tweaked the predicates that check correctness. Sorry we missed this one.

1 more reply

ks20481y ago

zinccat1y ago· 3 in thread

My feeling is that a lot of challenge could come from the tokenizer used by the model, similar to r in strawberry problem.

enumOP1y ago

Alternatively, maybe "Duracell" – no, too long. "Energizer"? No. "Eveready"? No. Wait, maybe "Goodyear". That's a tire company. Good-year. "Good" and "year". Not opposites.

Another approach: think of opposites like "yes/no", "hot/cold", "fast/slow", "high/low", "in/out", "on/off", "start/stop", "day/night". Maybe the brand incorporates one of these.

"Hi-Lo" is hyphenated. "Stop & Go" – no. "Speedpass" – not five letters. "7-Eleven" – no.

Wait, "Loews" hotels? Five letters. "Lo-ews". "Lo" as in low, and "ews" – not an opposite. Maybe not.

Wait, "Lucent" – a telecom company, but that's six letters. "Verizon" – seven letters.

Wait, "Motel 6" – no, includes a number. "Super 8" – same.

Wait, "Shell" – five letters but one syllable. "Chevron" – two syllables but seven letters.

Wait, "Pepsi" – five letters, two syllables. "Pep" and "si". "Pep" as in energy, and "si" – not an opposite.

"Cola" – not a brand. "Coke" – one syllable.

Wait, "Adidas" – three syllables. "Nike" – two syllables but four letters.

Wait, "Reebok" – six letters. "Fila" – four.

Wait, maybe a car brand. "Ford" – one syllable. "Chevy" – two syllables but five letters? C-H-E-V-Y. Five letters, "Chev" and "y"? Not opposites.

Wait, "Honda" – five letters, two syllables: Hon-da. Not opposites. "Toyota" – too long. "Lexus" – two syllables but five letters. "Le-xus". Not opposites.

Wait, "Nissan" – two syllables, six letters. "Mazda" – two syllables, five letters. M-A-Z-D-A. "Maz" and "da". Doesn't fit.

Wait, "Dodge" – one syllable. "Jeep" – one.

Wait, "Subaru" – three syllables.

Hmm. Let's think of gas stations again. "Shell", "Exxon", "Mobil", "Texaco", "Chevron", "BP", "Valero", "Citgo".

The brand name is *Citgo*.

rthnbgrredf1y ago

I'm still not convinced that this isn't a tokenizer issue.

1 more reply

zinccat1y ago

I see, but still there's a lot of reasonings just for counting the letters. And ridiculous reasonings like:

FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx.

1 more reply

brokensegue1y ago· 2 in thread

Are these really reasoning challenges? Seems like they are really solved via brute force or guess and check

enumOP1y ago

Heuristic search, not exhaustive search, is an essential ingredient of reasoning. Has been true since chess. Remains true with MCTS, LLMs, etc.

brokensegue1y ago

What heuristics are there for some of these?

aghilmort1y ago· 2 in thread

really great work! are you a co-author?

enumOP1y ago

Thanks, and yes :)

aghilmort1y ago

excellent - will be citing in next arxiv paper - may have questions, @dotdotjames on twitter if want to connect

sega_sai1y ago· 1 in thread

Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...

enumOP1y ago

They are scraped from the web, and discussed on Reddit. So, they are definitely in the training data. Despite that, the non-reasoning LLMs struggle to solve them.

scotty791y ago· 1 in thread

enumOP1y ago

Sure. Thats not the goal here. These problems are hard for humans and (mostly) hard for LLMs.

zone4111y ago

I have a set of independent benchmarks and most also show a difference between reasoning and non-reasoning models:

LLM Confabulation (Hallucination): https://github.com/lechmazur/confabulations/

LLM Step Game: https://github.com/lechmazur/step_game

LLM Thematic Generalization Benchmark: https://github.com/lechmazur/generalization

LLM Creative Story-Writing Benchmark: https://github.com/lechmazur/writing

Extended NYT Connections LLM Benchmark: https://github.com/lechmazur/nyt-connections/

and a couple more that I haven't updated very recently.

akomtu1y ago

The reasoning challenge is made of two parts:

1. Can you apply an existing model to a problem? For example: you're told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.

2. Can you come up with a model that explains the given examples? For example: you're given 10 triples like AxB=C and asked to explain what they have in common.

Simply imitating answers won't get you very far.

bryan01y ago

Are LLMs not trained on NPR transcripts?

j / k navigate · click thread line to collapse