Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."
Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."
----
"Realize a direct comparison of tenths is not enough: Wait, this is not right. In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. This initial comparison is incorrectly stated. Let's re-examine.
Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.
Re-examine the numbers digit by digit from left to right, after the decimal:
9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).
9.11: 1 in the tenths place, 1 in the hundredths place.
Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9. In 9.11, the tenths digit is 1. Wait... something is still wrong with my comparison logic. I'm getting confused. Let's re-read the numbers again.
Correctly read and compare tenths digits:
9.9 - Tenths digit is 9.
9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."
----
This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.
Wonder if the same thing would have happened with 9.12.
What was your original prompt?
> Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?
R1 fixates on answers of the form "CITY, STATE" and eventually returns some confidently wrong nonsense. It doesn't try to explore answers that don't fit the "CITY, STATE" template.
For example how in the “Humanity's Last Exam” AI evaluation dataset they have this question: “Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.”
That’s the kind of “PhD knowledge” which this evaluation does not require.
Many of the exams described on this page are personalized and only require mastery of a handful of papers that really matter to the individual student, and knowledge of related work:
https://www.cs.stanford.edu/phd-program-requirements-qualify...
I've been through something that sounds similar to these, and I would not have called it an exam.
As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.
Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).
(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)
But, I do think this is reasoning. It requires recall, but anything other than pure logic puzzles do. For example, on a competition math problem or a programming problem, No person or LLM is inventing well-known lemmas and algorithms from first-principles.
Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.
I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.
IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.
Imagine a model that isn't sure if 9.11 is greater than 9.9 - which is difficult to reason about, because tokens.
Could such a model coach kids in math? Could it proofread a paper, or sense-check a business plan? Could it summarise a long document about carbon emissions? Could it generate a GUI? Could it spot mistakes in an OCRed document? Spot an off-by-one error or divide-by-zero in computer code?
In fact, your final statement that these are tasks software should do rather than LLMs, is only proven to more people and made more clear by the prominence of these "gotchas"
- The paper has an example where the model reasons "I'm frustrated" and then produces an answer that it "knows is wrong". You wouldn't know it if you didn't examine the reasoning tokens.
- There are two examples were R1 often gets stuck "thinking forever"
If these failures happen on these questions, where else can happen? We'll start to find out soon enough.
"Here are a variety of personal documents about John Doe. Fill out the McDonalds job application with information retrieved from the document set."
I know this is a rant, sorry, just so tired of the stupidity.
There are far fewer tenure track positions than PhD's, so most of them not ending up with tenure is tautologically guaranteed.
1. my ability to reason and understand the subject after a lot of experience working in it, and
2. the skills that I picked up during my PhD and afterwards.
I can really only "know" so much, but the skills that I developed help me get the right information without having to "know" it directly --- either by searching the literature, running a calculation or simulation, or performing an experiment. There are a lot of things that I have difficulty remembering but I only "know" them because I remember how to derive them and where they come from, etc. That's one of the reasons that I don't like the assertion that this model has ingested all of this knowledge and it is now an expert, because expertise goes beyond having knowledge. It's about developing a deeper understanding of a subject that only comes from actively engaging with it for a long time.
(That just one caught my attention because I was curious what challenge o1-mini got correct that o1 did not.)
Take a familiar brand name, seen along roads and highways in the United States. It has five letters, two syllables. The first syllable, phonetically, is a word that is the opposite of the word spelled by the second syllable. What brand name is it?
<think> Okay, let me try to figure out this riddle. The question is about a brand name that's familiar on roads and highways in the US. It has five letters and two syllables. The key part is that the first syllable phonetically is the opposite of the second syllable spelled out. Hmm.
First, I need to think of common brands seen on highways. Gas stations, fast food, hotels, maybe car brands. Five letters, two syllables. Let's list some possibilities: Exxon, Shell, Chevron, Mobil, BP (but BP is two letters, so probably not). Maybe food places like Subway, but that's six letters. McDonald's is too long. Burger King? No. Maybe hotels: Motel 6? No, that's a number. Holiday Inn? Too long. Maybe Best Western? No, too long.
Wait, five letters. Let's think of two-syllable brands. Exxon (E-xxon), but Exxon is five letters. First syllable is "Ex", second is "xon". But what's the opposite of "xon"? That doesn't make sense. Maybe the first syllable is a word, and the second syllable when spelled is the opposite. So, for example, if the first syllable is "up", the second syllable's letters spell "down". But "updown" isn't a brand. Wait, the first syllable phonetically is a word, and the second syllable's spelling is the opposite. So when you say the first syllable, it's a word, and the letters of the second syllable spell its opposite.
So, for example, if the brand is "Updn" (made up), where "up" is the first syllable, and "dn" spells "down". But that's not a real brand. Let's think of real brands. Maybe "Comcast"? No, too long. "Verizon"? Also longer. "Sprint"? Five letters. Sprint. S-print. But "S" doesn't mean anything opposite of "print". Hmm. Maybe "Exxon" again. "Ex" as in former, and the opposite would be "xon" spelled as... no.
Wait, maybe the first syllable is a word that's the opposite of the word formed by the second syllable's letters. For instance, "Sunoco" is a gas station, but that's six letters. "Shell" is five letters but one syllable. "Hertz" is car rental. Five letters, one syllable. "Avis"? Four letters. "FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx. The first syllable is "Fed", the second is "Ex". Now, "Fed" as in past tense of feed, but maybe "Fed" as in federal. The opposite of "Ex" could be "current" or "present". But that doesn't seem right. Alternatively, "Ex" as in former, so the opposite would be "current" or "new". Hmm, not sure. Maybe "FedEx" isn't the answer.
Another thought: "Amtrak" is a brand on highways, but that's six letters. "Greyhound"? Too long. "Dollar" car rental? Five letters? No, six. "Thrifty" is seven. Wait, maybe "Lyft"? No, four letters. "U-Haul" is hyphenated.
Wait, let's think of opposites. The first syllable's sound is a word, and the second syllable's spelling is its opposite. So, for example, if the brand is "Inote", where "I" (pronounced "eye") and "note", but "eye" and "note" don't have opposites. Alternatively, "Hi-Lo", but that's hyphenated. Maybe "Dayton" (a tire company?), but "Day" vs "ton" – not opposites.
Alternatively, maybe "Duracell" – no, too long. "Energizer"? No. "Eveready"? No. Wait, maybe "Goodyear". That's a tire company. Good-year. "Good" and "year". Not opposites.
Wait, the key is that the first syllable phonetically is a word, and the second syllable's spelling is the opposite. So, for instance, if the first syllable is "pro", the second syllable's letters spell "con". But "procon" isn't a brand. Or "up" and "down". Maybe "Updown", but not a real brand.
Wait, the brand "Exxon" – "Ex" and "xon". If "Ex" is the opposite of "xon" spelled as "xon" – no. Maybe "Ex" (as in former) and "xon" – but "xon" isn't a word. Alternatively, "Exxon" could be broken into "Ex" and "on". Wait, "Ex" (as in former) and "on" (the opposite being "off"). But "Exxon" is E-X-X-O-N, so the second syllable is "xon", not "on". Hmm.
Another approach: think of opposites like "yes/no", "hot/cold", "fast/slow", "high/low", "in/out", "on/off", "start/stop", "day/night". Maybe the brand incorporates one of these.
"Hi-Lo" is hyphenated. "Stop & Go" – no. "Speedpass" – not five letters. "7-Eleven" – no.
Wait, "Loews" hotels? Five letters. "Lo-ews". "Lo" as in low, and "ews" – not an opposite. Maybe not.
Wait, "Lucent" – a telecom company, but that's six letters. "Verizon" – seven letters.
Wait, "Motel 6" – no, includes a number. "Super 8" – same.
Wait, "Shell" – five letters but one syllable. "Chevron" – two syllables but seven letters.
Wait, "Pepsi" – five letters, two syllables. "Pep" and "si". "Pep" as in energy, and "si" – not an opposite.
"Cola" – not a brand. "Coke" – one syllable.
Wait, "Adidas" – three syllables. "Nike" – two syllables but four letters.
Wait, "Reebok" – six letters. "Fila" – four.
Wait, maybe a car brand. "Ford" – one syllable. "Chevy" – two syllables but five letters? C-H-E-V-Y. Five letters, "Chev" and "y"? Not opposites.
Wait, "Honda" – five letters, two syllables: Hon-da. Not opposites. "Toyota" – too long. "Lexus" – two syllables but five letters. "Le-xus". Not opposites.
Wait, "Nissan" – two syllables, six letters. "Mazda" – two syllables, five letters. M-A-Z-D-A. "Maz" and "da". Doesn't fit.
Wait, "Dodge" – one syllable. "Jeep" – one.
Wait, "Subaru" – three syllables.
Hmm. Let's think of gas stations again. "Shell", "Exxon", "Mobil", "Texaco", "Chevron", "BP", "Valero", "Citgo".
Citgo: C-I-T-G-O. Two syllables: Cit-go. "Cit" (like "sit") and "go". Wait, "sit" and "go" – those are opposites in a way. If you sit, you're not moving; if you go, you're moving. So "Citgo" – "Cit" phonetically is "sit", and "go" is spelled as "go". So the first syllable sounds like "sit", which is the opposite of "go". That might be it!
Citgo is a gas station brand, seen on highways, five letters (C-I-T-G-O), two syllables: Cit-go. Phonetically, "Cit" sounds like "sit", and "go" is the opposite. So the first syllable (phonetically "sit") is the opposite of the second syllable spelled "go". That seems to fit.
Let me double-check. The brand is Citgo. Five letters, two syllables. First syllable "Cit" pronounced like "sit", which is the opposite of "go". Yes, that makes sense. I think that's the answer. </think>
The brand name is *Citgo*.
Citgo is a well-known gas station brand commonly seen along U.S. roads and highways. It has five letters and two syllables: *Cit* (pronounced like "sit") and *go*. The first syllable phonetically represents "sit," which is the opposite of the second syllable spelled as "go." This clever play on opposites fits the riddle's criteria perfectly.
Were you able to find a substantial number of questions that do not fall into the letter countinh or word shuffling domsin - problems that are clearly unrelated to the fundamental tokenizer issue of modern LLMs? Otherwise, I would argue that your paper simply proves that the issue still exists.
FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx.
There are however new problems each week, and released every week. So, we can safely assume the latest problems are decontaminated. It remains to be seen if and how performance drops on the problems released in 2025. (Not enough problems yet to tell.)
LLM Confabulation (Hallucination): https://github.com/lechmazur/confabulations/
LLM Step Game: https://github.com/lechmazur/step_game
LLM Thematic Generalization Benchmark: https://github.com/lechmazur/generalization
LLM Creative Story-Writing Benchmark: https://github.com/lechmazur/writing
Extended NYT Connections LLM Benchmark: https://github.com/lechmazur/nyt-connections/
and a couple more that I haven't updated very recently.
1. Can you apply an existing model to a problem? For example: you're told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.
2. Can you come up with a model that explains the given examples? For example: you're given 10 triples like AxB=C and asked to explain what they have in common.
Simply imitating answers won't get you very far.