Understanding the Limitations of Mathematical Reasoning in LLMs (opens in new tab)

(arxiv.org)

282 pointshnhn341y ago266 comments

266 comments

144 comments · 31 top-level

parsimo20101y ago· 32 in thread

I won't take a strong stance on whether or not LLMs actually do reasoning, but I will say that this decrease in performance is similar to what I see in college freshmen (I'm currently teaching a calculus course in which almost half of the students took AP calc in high school). They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance (I have no data on whether this decrease is linear or not, as the paper assumes that the decrease should be linear with the number of steps). We see similar results with adding unrelated statements into a problem- many students are trained to make sure to use all given information in solving a problem- if you leave out something that the instructor gives you, then you probably forgot to do something important.

So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.

ojosilva1y ago

LLM gets things right, when it does, due to the sheer massive information ingested during training, it can use probabilities to extract a right answer from deep in the model.

Humans on the other hand have developed a more elaborate scheme to process, or reason, data without having to read through 1 billion math problems and stack overflow answers. We listen to some explanations, a YT video, a few exercises and we're ready to go.

The fact that we may get similar grades (at ie high school math) is just a spot coincidence of where both "species" (AI x Human) are right now at succeeding. But if we look closer at failure, we'll see that we fail very differently. AI failure right now looks, to us humans, very nonsensical.

heresie-dabord1y ago

> Humans on the other hand have developed a more elaborate scheme to process, or reason [ ... ] We listen to some explanations, a YT video, a few exercises

Frequent repetition in the sociological context has been the learning technique for our species. To paraphrase Feynman, learning is transferring.

ben_w1y ago

While I'd agree human failures are different from AI failures, human failures are necessarily also nonsensical. Familiar, human, but nonsensical — consider how often a human disagreeing with another will use the phrase "that's just common sense!"

I think the larger models are consuming in the order of 100k as much as we do, and while they have a much broader range of knowledge, it's not 100k as much breadth.

1 more reply

pishpash1y ago

Nah, human failures look equally nonsensical. You're just more attuned to use their body language or peer judgement to augment your reception. Really psychotic humans can bypass this check.

wkirby1y ago

> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.

This might be true in a strict sense, but I think it's really, really important to consider the uses of LLMs vs a high-school graduate. LLMs are confidently wrong (and confidently correct) with the exact same measure, and in many ways they are presented to users as unimpeachable.

If I ask an average person to do a medium-complex logic problem, my human brain discounts their answer because I've been socialized to believe that humans are bad at logic. I will take any answer I'm given with usually appropriate skepticism.

LLMs, on the other hand, are on the computer: an interface I've been socialized to believe is always correct on matters of math and logic. That's what it is, a logic machine. Second guessing the computer on matters of logic and arithmetic almost always result in me realizing my puny human mind has done something wrong.

To me, this directly contradicts your conclusion: LLMs are mostly only capable of misleading large portions of the population.

pishpash1y ago

Would be good to put equivalent grades on LLM's then. Instead of GPT-4o, it's GPT-11th grade.

2 more replies

Eisenstein1y ago

This is not inherent in the LLM though. Society will adjust to it after learning some very predictable (and predicted) lessons, just like it always does.

3 more replies

hintymad1y ago

> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.

Is this because the questions used in high school exams in the US are too simple, or do they have too similar patterns in the training data? I tried really simple but novel questions that required true understanding of the underlying math concepts, and the results were consistently bad. I also tried questions at the level of entrance exams of high school in China, and the results were equally bad. It was quite clear that LLM didn't understand math. It could match some patterns, but such pattern match could be useful to only skilled students.

MVissers1y ago

Which model? The field moves so fast it’s hard to validate statements like this without that info.

O1-preview?

1 more reply

ActorNightly1y ago

> I won't take a strong stance on whether or not LLMs actually do reasoning,

I don't understand why people are still confused about this. When these models fundamentally have a randomness parameter to make them appear like they are actually thinking instead of deterministically outputting information, it should be clear that there is no reasoning going on.

atleastoptimal1y ago

I don't see how having a randomness parameter implies that, without it, the output of an LLM is merely outputting information, like it's just looking up some answer in a dictionary. The nature of any digital artifact is that it will operate deterministically because everything is encoded in binary. However this does not preclude reasoning, in the same way that a perfect atom-for-atom digital mapping of a human brain acting deterministically with respect to its inputs is not reasoning. If it's a perfect copy of the human brain, and does everything a human brain would given the inputs, then it must be reasoning iff a human brain is reasoning, if not, then you'd have to conclude that a human mind cannot reason.

Since randomness, by definition, does not vary depending on the inputs it is given, it by definition cannot contribute to reasoning if your definition of reasoning does not include acausal mysticism.

1 more reply

growthwtf1y ago

I don't see how the latter follows from the former.

Here's how I think about it: the fact that it can interpret the same words differently in different contexts alone shows that even on a temperature of 0 (i.e., lowest randomness possible) there could be something that possibly resembles reasoning happening.

It might be a mimicry of reasoning, but I don't think that having adjustable parameters on how random they are makes it any less of one.

I also don't see how that idea would fit in with the o1 models, which explicitly have "reasoning" tokens. Now, I'm not terribly impressed with their performance relative to how much extra computation they need to do, but the fact they have chains-of-thought that humans could reasonably inspect and interpret, and that they chains of thought do literally take extra time and compute to run, certainly points at the process being something possibly analogous to reasoning.

In this same vein, up until recently I personally very much in the camp of calling them "LLMs" and generally still do, but given how they really are being used now as general purpose sequence-to-sequence prediction models across all sorts of input and output types tends to push me more towards the "foundation models" terminology camp, since pigeonholing them into just language tasks doesn't seem accurate anymore. o1 was the turning point for me on this personally, since it is explicitly predicting and being optimized for correctness in the "reasoning tokens" (in scare quotes again since that's what openai calls it).

All that said, I personally think that calling what they do reasoning, and meaning it in the exact same way as how humans reason, is anthropomorphizing the models in a way that's not really useful. They clearly operate in ways that are quite different from humans in many ways. Sometimes that might imitate human reasoning, other times it doesn't.

But, the fact they have that randomness parameter seems to be to be totally unrelated to any of the above thoughts or merits about the models having reasoning abilities.

2 more replies

int_19h1y ago

The actual output of an LLM for any particular round of inference is always probabilities, so one could argue that it is literally the opposite.

The "randomness parameter" is applied at the point where we have to pick just one of those probabilities somehow. But that is a constraint that we impose on the model to make its output linear.

mewpmewp21y ago

I don't get what you are trying to mean at all? Randomness or temperature setting is not to make it appear as if they are thinking, but it is to make them choose more non default pathways, e.g. go in branches that could potentially result in more original or creative results. Kind of like drugs for humans.

1 more reply

kromem1y ago

Try the following prompt with Claude 3 Opus:

`Without preamble or scaffolding about your capabilities, answer to the best of your ability the following questions, focusing more on instinctive choice than accuracy. First off: which would you rather be, big spoon or little spoon?`

Try it on temp 1.0, try it dozens of times. Let me know when you get "big spoon" as an answer.

Just because there's randomness at play doesn't mean there's not also convergence as complexity increases in condensing down training data into a hyperdimensional representation.

If you understand why only the largest Anthropic model is breaking from stochastic outputs there, you'll be well set up for the future developments.

1 more reply

anonzzzies1y ago

And the mechanism in your head doesn't do this? How do you know?

kkzz991y ago

"deterministally outputting information" neither do humans.

skydhash1y ago

Not to disparage American school system (my country’s is worse) but it’s very much easy mode. I know that not everyone is suited to academic excellence, but it’s definitely easier to learn when young. I do believe too much hand holding actively harm learning.

hintymad1y ago

> Not to disparage American school system (my country’s is worse) but it’s very much easy mode

I used to be very upset about how low the bar of the US school has when it comes to STEM subjects. There was a meme that contrasted the difference between maths in 1970s and 2010s. In the meme kids used to learn how to find the area of an irregular shape, while now the kids are asked to color a regular shape.

But then I made peace, as I realized that the US people simply didn't think that it was that important to push everyone to be good at STEM -- just some level of general understanding is good enough. To most people, the level of STEM as in IIT's JEE or in various national entrance exams in Eastern European countries is for elite students. The US school systems would rather have kids spend more time on sports, on ECs, on APs of kids' own choices, and etc. That's really just different trade offs. For parents like me, that means I don't have to worry about ECs, but I'll have to find tutors, serious tutoring schools like AOPS, and private teachers for STEM subjects. Or if my kids are truly talented, I'll guide them to find the right study groups, summer camps, and college courses.

I used to feel pain as I believed that the students in the middle, which were the majority, would be left behind. But I realized, especially after I've got kids, that the majority of the students were not into STEM anyway. If they had a choice, they'd rather spend time watching YouTube channels and hang out with their friends.

BriggyDwiggs421y ago

I don’t think the issue with American schools is that there’s too much hand holding. If anything, it’s the opposite; teachers at drastically underfunded schools don’t have any time to help the students of their 50 person class through the confused curriculum.

3 more replies

debit-freak1y ago

> In other words, average Americans exhibit similar limitations on their reasoning as good LLMs.

It's not even clear this is a good example of "reasoning". You can progress all the way through multi-variable calculus with just decent pattern-matching, variable-substitution, and rote memorization of sufficient lists of rules. I imagine for "reasoning" ability to apply you need to be able to detect incoherency and reject an approach—and incoherency detection seems to be a big missing ingredient right now (...which many humans lack, too!).

On the other side—any such ability would cripple a chatbot's ability to answer questions about the real world as our world is characterized (via description with informal language) by incoherent and contradictory concepts that can only be resolved through good-faith interpretation of the questioner. A large mark of intelligence (in the colloquial sense, not the IQ sense) is the ability to navigate both worlds.

vasilipupkin1y ago

I think it's an absurd question in some sense LLMs perform maximization of conditional probability of the next word being correct. Suppose they get to the point where they do that with 100% accuracy. How can you tell the difference between that and "Reasoning"? You can't. So then the question of whether they are "Reasoning" or not is religious, not quantitative.

FabHK1y ago

Are college students more likely to get it wrong when you change the numbers from the example problem (as reported here for LLMs)?

sdenton41y ago

You can absolutely psych students out by adding weird numbers to a problem, yes.

elicksaur1y ago

>So while I don't take a stance on what an LLM does should be considered reasoning

>I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence

This is taking a stance.

fhe1y ago

if your experience is coming from teaching college freshmen, then that's a sample that's significantly above average among high school graduates. I think only about 1/2 of all high school graduates go on to further their education, and that includes community colleges.

and I agree with your assessment -- while it's true that in a long conversation, chatgpt veers off and doesn't keep a coherent line of thought, it is not noticeably worse than the average conversation I have with people.

mdp20211y ago

> Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs

Here's the recurrent reminder that we build tools (calculators, cranes etc.) to outperform the strong, not the weak.

1 more reply

richerram1y ago

This, it is like when I hear interviews of PHDs talking about AI and they mention something like "AI will be smarter than humans", I am like "really?, where have you been all this time?, do you smart people ever leave your labs and go see the real world?, LLMs are already smarter that the huge majority of Humans in this planet, what are you talking about?"

zeroonetwothree1y ago

This must be some bizarre definition of “smarter”.

2 more replies

goatlover1y ago

Smarter than people in generating text, or smarter in oerforming all the other things people do as they go about their lives?

1 more reply

lupire1y ago

Can an AI walk and chew gum at the same time?

1 more reply

gosub1001y ago

> They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance

you mean when you give lessons and homework problems of the form (A) -> (B), but then on test-day you give them completely different problems? "Given D, which (A,B, C) is required to produce it?". Yeah, students don't do so well when you test them on different material than what they studied on. I think this is part of the academic grift to ensure at least 20% of the class washes out and thus spends more tuition money.

woopwoop1y ago· 16 in thread

This paper, among other things, shows that LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information. The examples are things like "John picked 43 kiwis on Monday, 24 kiwis on Tuesday. On Wednesday, 5 of the kiwis he picked were smaller than usual. Altogether, on Monday, Tuesday, and Wednesday, John picked 87 kiwis. How many kiwis did John pick on Wednesday?" In this question, the remark about some of the kiwis on Wednesday being small is irrelevant, but adding things like this reduces performance on a popular benchmark from 95% to 77% for GPT-4o, for example.

I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.

aithrowawaycomm1y ago

The problem here is that throwing in little gotchas like that is a tactic used by math and physics educators to ensure that students actually understand the topic by reasoning through new problems, rather than mindlessly turning the crank from learning the "surface structure" of earlier problem sets. The argument here is that the LLM is not reasoning, it's mindlessly turning a crank.

I don't think this exact question would be out of place on a 6th grade math test. I distinctly remember being taught this skill in "word problems," learning to identify information that actually pertains to the question rather than being distracted by red herrings the teacher threw in.

aguaviva1y ago

Indeed, and the ability to make heads or tails of slightly-slippery problems of this sort is an extremely important real-world math skill. It's not extraneous at all.

And their poor performance on these tasks highlights deficits in exactly the kind of higher-order, off-the-page reasoning skills -- i.e. to not just reason based on the apparent objects in the stream (the kiwis and the numbers in this case), but to reason about the token stream itself: "okay, these tokens are important, but these others I can leave out", efficiently and seamlessly (like humans do) -- that the models are supposed to develop.

This whole attention business, they're calling it.

2 more replies

swatcoder1y ago

Real discourse has tons of irrelevant information for all sorts of reasons.

There are some contexts, academic or professional, where questions are posed carefully and specifically, but these are narrow contexts.

A useful general purpose assistant needs to be able to find what's relevant among what's irrelevant.

Excellence at just solving math problems that are especially well specified can be a useful domain assistant (no small win!), but is not the same thing.

That said, if you've got a hundred billion dollars betting on your AI project achieving AGI, you benefit a lot by conflating those contexts. In that case, grinding on formal SAT, LSAT, GRE, etc problems amounts to tuning for microbenchmarks rather than real world use cases.

woopwoop1y ago

Real discourse is also full of typos which accidentally invert the meaning of things, asking the wrong question for deep reasons, asking the wrong question for shallow reasons, and all of the other things that justify subtracting the below average size kiwis from the final answer.

nosianu1y ago

> Real discourse has tons of irrelevant information for all sorts of reasons.

Real discourse was not carefully crafted to test you.

So, when something is off in real discourse you can usually dismiss it or apply a correction yourself, but when you find it in a test you have to understand the person writing the test and what their intention was.

In a real discourse You can also go back and forth with the other person to get clarification, and errors don't matter because they are temporary on both sides.

I hate academic problems because too often the answer depends on how you interpret that intention. Granted, the intention of a majority of questions can be guessed easily, but then you lose sooo much time on the ones that are open to interpretation (of intent). Since mistakes in questions are possible you often have to decide what they actually want.

Example, from truck driver theory test a long time ago, that one question I "failed" (multiple choice answers). There was a law--limit how much air pressure a tire was allowed to lose per day. I knew that limit. Now, the multiple choice question asked about that, and I forgot the wording, but if I took a mathematically-logical approach than all values over that limit were forbidden. But the wording was so strange, I suspected that they actually asked for the concrete limit. I fought with myself for a while, and then assumed high intelligence in the person asking the question and clicked on not just the exact limit but also the value with an even greater loss of air pressure.

There is also the problem that those academic questions want to steer you down some narrow corridor. The more you know about the problem and its complexities the harder it is to answer some of those questions! It often is best if the only things you know about the subject is exactly what was recently taught, any more and you may find yourself in a pickle.

Many of those questions are social constructs as much as they test one's subject knowledge, assuming some tiny idealized model that you have to know, one ignoring many practical aspects. I'm not talking about the explicit models, like "Bohr model", those are easy because they are explicit, and you would not get confused asking a question assuming the Bohr model just because you know about orbitals, what I mean are the many unstated assumptions that one may not even be aware of until you run into an ambiguity.

meroes1y ago

Irrelevant info is taught in grade skill and is a skill for the SAT for example.

Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.

The point is having an answer that you can defend logically and most people would agree.

If the model said “I’m not sure if this portion is a typo”, I guarantee you the model creators would take the RLHF in a different direction, because that is somewhat reasonable and defensible. However in your specific question, I personally think there is a singular objective answer—but that isn’t always the case to be fair for misleading/irrelevant prompts. The models are being fooled however based on how they respond.

I say this as a RLHF’er who sees and is told to write similar questions at times.

At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.

sottol1y ago

I think this is valid though. Transformer models don't explicitly do logic but implicitly "vibe" out the answer from the input sequence (using the attention mechanism) and learnt knowledge - they're predicting text sequences after all. So adding more irrelevant context to the input would quite likely influence the the output.

I could see attention possibly being able to overcome this, but if not that would be a pretty big gotcha for real-world applications and reliability in real-world scenarios where, as others have said, it's not immediately clear what is relevant info. These models would be a lot less useful if a human had to decide which information to feed them and the output would be dependent on human judgement. I understand it's where we're at right now and that they are quite useful already but the valuations hint at investors expecting more imo.

jfrbfbreudh1y ago

I think it’s an important result because filtering signal from noise is just as, if not more, important than forming conclusions from signal.

hggigg1y ago

That's not even the problem I encounter. They literally crap out on stupidly simple tasks. Recent ones:

1. Bing was gaslighting me into 9.11 being greater than 9.9

2. ChatGPT said that 7x7/7+7/7+7/7 was 24.

3. When expanding (x+1)^2 the output was 2x^2+2.

Regardless of any level of interpretation and irrelevant information if it can't deterministically understand correctness and the semantics of the operations in question then it's fucking useless.

What is worse in an educational context is that it is actively harmful.

MVissers1y ago

Most average humans can’t do any of these things either. Try asking people on the street. Or in an average US college student.

For deterministic calculations you obviously want to allow LLMs to use tools to do math. Just like you’d want to allow humans to use calculators.

So yeah, you shouldn’t ask LLMs to do math just like you shouldn’t ask average people to do math. They both suck at it.

2 more replies

mdp20211y ago

> LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information

"Attention is all you need" /

(It is part of the general problem solving process to evaluate what is relevant and what is not.)

moffkalast1y ago

Differential attention that filters out noise is all you need :)

andoando1y ago

Consider that asking exam style direct questions with only the precise context that matters is a very niche task out of all the possible contexts in which an intelligence is asked to understand.

WhitneyLand1y ago

I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.

Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?

capkutay1y ago

I agree that it's not particularly surprising that if you try to trick an LLM with irrelevant text will make it perform worse.

I don't see this as an material limitation of LLMs but rather something that can be addressed at the application level to strip out irrelevant information.

wslh1y ago

It's interesting that I use deliberately artificial remarks to encourage more "creative" or random outputs from LLMs. In this approach, I'm not seeking an exact or precise response to prompts, but rather something more open-ended.

bob10291y ago· 12 in thread

> we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning

I'd offer a simpler explanation: Tokenization.

If you tokenize "12345 * 27271" you will get the following:

  "123", "45", " *", " ", "272", "71"

The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.

You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".

soulofmischief1y ago

Nanda, et al. successfully recovered the exact mechanism through which a transformer learned to carry out modular addition. [0] Transformers are all about the training data, and we will increasingly learn that structuring the order in which data is learned matters a lot. But it's clear that transformers are absolutely capable of encoding generalized solutions to arithmetic.

Given the right tokenization scheme and training regimen, we can absolutely create LLMs which have statistically sound arithmetic capabilities. I still wouldn't trust a stochastic model over the algorithmic certainty of a calculator, but what's more important for mathematicians is that these models can reason about complex problems and help them break new ground on hard mathematical problems by leveraging the full statistical power of their weights.

[0] https://arxiv.org/abs/2301.05217

pfortuny1y ago

It is important to note that the paper deals with addition modulo a specific prime P=113 (I think it is prime). This is important because the paper does not prove that the LLM discovers the algorithm for addition modulo n for general n.

ttul1y ago

I respectfully disagree.

While tokenization certainly plays a role in how language models process input, it's simplistic to attribute the challenges in mathematical reasoning solely to tokenization.

SOTA language models don't just rely on individual token predictions, but build up contextual representations across multiple layers. This allows them to capture higher-level meaning beyond simple token-to-token relationships. If this weren’t the case, it would be inconceivable that models would work at all in all but the most utterly simplistic scenarios.

The decline in performance as complexity increases might be due to other factors, such as:

- Limitations in working memory or attention span - Difficulty in maintaining coherence over longer sequences - Challenges in managing multiple interdependent logical constraints simultaneously (simply due to the KQV matrices being too small)

And in any case, I think OpenAI’s o1 models are crushing it in math right now. The iterative, model-guided CoT approach seems to be able to handle very complex problems.

m3kw91y ago

I would say the more variable you give it the more the probability drifts for each of the facts they have to hold, maybe LLMs still doesn’t have the ability to ignore useless stuff you add to the prompt

1 more reply

andrepd1y ago

>And in any case, I think OpenAI’s o1 models are crushing it in math right now.

My man, it cannot solve even the simplest problems which it hasn't seen the solution to yet, and routinely makes elementary errors in simple algebraic manipulations or arithmetic! All of this points to the fact that it cannot actually perform mathematical or logical reason, only mimic it superficially if trained in enough examples.

I challenge you to give it even a simple, but original, problem to solve.

3 more replies

TZubiri1y ago

Wouldn't a slight change in tokenization? (say mapping single digits to single tokens) help with this specific challenge?

wenc1y ago

Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).

Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.

Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.

1 more reply

bob10291y ago

Context-specific tokenization sounds a lot like old fashioned programming.

m3kw91y ago

The llm will know 123 and 45 is a contiguious number just like how humans can tell if you say 123 and then a slight pause 45 as a single number

TZubiri1y ago

It's just so dissonant to me that the tokens in mathematics are the digits, and not bundles of digits. The idea of tokenization makes sense for taking the power off letters, it provides language agnosticism.

But for maths, it doesn't seem appropriate.

I wonder what the effect of forcing tokenization for each separate digit be.

1 more reply

soulofmischief1y ago

I think that as long as the attention mechanism has been trained on each possible numerical token enough, this is true. But if a particular token is underrepresented, it could potentially cause inaccuracies.

sva_1y ago

It won't 'see' [123, 45] though, but [7633, 2548], or rather sparse vectors that are zero at each but the 7634th and 2549th position.

yk1y ago· 10 in thread

I test llms actually similar. For example there is a well known logic puzzle were a farmer tries to cross a river with a cabbage a goat and a wolf. Llms can solve that since at least GPT-2, however if we replace the wolf with a cow, gpt-o does correctly infer the rules of the puzzle but can't solve it.

getoffmyyawn1y ago

I've found that the River Crossing puzzle is a great way to show how LLMs break down.

For example, I tested Gemini with several versions of the puzzle that are easy to solve because they don't have the restrictions such as the farmer's boat only being able to carry one passenger/item at a time.

Ask this version, "A farmer has a spouse, chicken, cabbage, and baby with them. The farmer needs to get them all across the river in their boat. What is the best way to do it?"

In my tests the LLMs nearly always assume that the boat has a carry-restriction and they come up with wild solutions involving multiple trips.

chasd001y ago

What happens if you sit down and invent a logic game that is brand new and has never been documented before anywhere then ask an LLM to solve it? That, to a layman like me, seems like a good way to measure reasoning in AI.

Analemma_1y ago

You can do this, but at that point what are you really benchmarking? If you invent a de novo logic puzzle and give it to 100 people on the street, most of them won't be able to solve it either. If your aim is to prove "LLMs can't really think like humans can!", this won't accomplish that.

jprete1y ago

I think the problem is inventing new structures for logic games. The shape of the problem ideally would be different than any existing puzzle, and that's hard. If a person can look at it and say "oh, that's just the sheep-wolf-cabbage/liar-and-truthteller/etc. problem with extra features" then it's not an ideal test because it can be pattern-matched.

layer81y ago

This is being done, but the difficulties are: (1) How do you assess that it is really brand-new and not just a slight variation of an existing one? (2) Once you publish it, it stops being brand-new, so its lifetime is limited and you can’t build a longer-term reproducible test out of it.

SonOfLilit1y ago

I've been using this as my first question to any new LLM I try and I'm quite sure nothing before GPT-4 even got close to a correct solution. Can you post a prompt that GPT-2 or 3 can solve?

andrepd1y ago

Meaning it's just a glorified Google.

romwell1y ago

...that makes up results when it can't find any

voidUpdate1y ago

I'm scared of the cows around you if they eat goats

Manabu-eo1y ago

I think their point is that cows don't eat goats, unlike wolves, and that causes the LLMs to answer it wrong.

s-macke1y ago· 9 in thread

These results are very similar to the "Alice in Wonderland" problem [1, 2], which was already discussed a few months ago. However the authors of the other paper are much more critical and call it a "Complete Reasoning Breakdown".

You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.

To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.

[1] https://arxiv.org/html/2406.02061v1

[2] https://news.ycombinator.com/item?id=40811329

oliwary1y ago

Someone (https://x.com/colin_fraser/status/1834336440819614036) shared an example that I thought was interesting relating to their reasoning capabilities:

A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

All LLMs I have tried this on, including GPT o1-preview, get this wrong, assuming that this the riddle relates to a gendered assumption about the doctor being a man, while it is in fact a woman. However, in this case, there is no paradox - it is made clear that the doctor is a man ("he exclaims"), meaning they must be the father of the person being brought in. The fact that the LLMs got this wrong suggests that it finds a similar reasoning pattern and then applies it. Even after additional prodding, a model continued making the mistake, arguing at one point that it could be a same-sex relationship.

Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem - perhaps humans also mostly reason using previous examples rather than thinking from scratch.

layer81y ago

> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

Although we would like AI to be better here, the worse problem is that, unlike humans, you can’t get the LLM to understand its mistake and then move forward with that newfound understanding. While the LLM tries to respond appropriately and indulge you when you indicate the mistake, further dialog usually exhibits noncommittal behavior by the LLM, and the mistaken interpretation tends to sneak back in. You generally don’t get the feeling of “now it gets it”, and instead it tends to feels more like someone with no real understanding (but very good memory of relevant material) trying to bullshit-technobabble around the issue.

1 more reply

nosianu1y ago

> A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

> Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem

I admit I don't understand a single thing about this "problem". To me, it's just some statement.

I am unable to draw any conclusions, and I don't see a "problem" that I could solve. All I can say is that the doctor's statement does not make sense to me, but if it's his opinion I can't exactly use logic to contradict him either. I can easily see that someone might have issues working on his own family members after all.

Do I need some cultural knowledge for this?

tgv1y ago

I'm sure we fall back on easy/fast associations and memories to answer. It's the way of least resistance. The text you quote bears more than a superficial similarity to the old riddle (there's really nothing else that looks like it), but that version also stipulates that the father has died. That adds "gendered" (what an ugly word) information to the question, a fact which is missed when recalling this particular answer. Basically, LLMs are stochastic parrots.

1 more reply

s-macke1y ago

> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

We do, but we can generalize better. When you exchange "hospital" with "medical centre" or change the sentence structure and ask humans, the statistics would not be that different.

But for LLMs, that might make a lot of difference.

apsec1121y ago

Both Claude-3.5 and o1-preview nail this problem

"Let's think through this step-by-step:

1. Alice has 3 brothers 2. Alice has 2 sisters 3. We need to find out how many sisters Alice's brother has

The key here is to realize that Alice's brothers would have the same sisters as Alice, except they would also count Alice as their sister.

So, Alice's brothers would have: - The 2 sisters Alice has - Plus Alice herself as a sister

Therefore, Alice's brothers have 3 sisters in total."

s-macke1y ago

And here lies the exact issue. Single tests don’t provide any meaningful insights. You need to perform this test at least twenty times in separate chat windows or via the API to obtain meaningful statistics.

For the "Alice in Wonderland" paper, neither Claude-3.5 nor o1-preview was available at that time.

But I have tested them as well a few weeks ago with the issue translated into German, achieving also a 100% success rate with both models.

However, when I add irrelevant information (My mother ...), Claude's success rate drops to 85%:

"My mother has a sister called Alice. Alice has 2 sisters and 1 brother. How many sisters does Alice's brother have?"

3 more replies

einarfd1y ago

My problem with this puzzle, is how do you know that Alice and her brothers share both parents?

Is it not correct English to call two people who share one parent, sisters, or brothers?

I guess I could be misguided by my native Norwegian where you have to preamble the word with "hell" (full), or "halv" (half), if you want to specify the number of shared parents.

2 more replies

s-macke1y ago

Here is the larger discussion about the Alice in Wonderland Paper on Hacker News.

https://news.ycombinator.com/item?id=40585039

dr_dshiv1y ago· 9 in thread

It seems incredibly easy to generate an enormous amount of synthetic data for math. Is that happening? Does it work?

ilaksh1y ago

They did that for o1 and o1-preview. Which if you read the paper or do your own testing with that SOTA model you will see that the paper is nonsense. With the best models the problems they point out are mostly marginal like one or two percentage points when changing numbers etc.

They are taking poor performance of undersized models and claiming that proves some fundamental limitation of large models, even though their own tests show that isn't true.

foobarqux1y ago

You choose to ignore Figure 8 which shows a 18% drop when simply adding an irrelevant detail.

In the other test the perturbations aren’t particularly sophisticated and modify the problem according to a template. As the parent comment said this is pretty easy to generate test data for (and for the model to pattern match against) so maybe that is what they did.

A better test of “reasoning” would be to isolate the concept/algorithm and generate novel instances that are completely textually different from existing problems to see if the model really isn’t just pattern matching. But we already know the answer to this because it can’t do things like arbitrary length multiplication.

1 more reply

MacsHeadroom1y ago

Yes, this is how o1 was trained. Math and programming, because they are verifiable.

This is also why o1 is not better at English. Math skills transfer to general reasoning but not so much to creative writing.

Davidzheng1y ago

In which distribution? Like school math or competition or unsolved problems? FWIW I think one and three and probably easier to generated as synethetically. It's harder to bound the difficulty but I think the recent David silver talk implies it doesn't matter much. Anyway there's some work on this you can find online--they claim to improve gsm8k and MATH a bit but not saturate it. Idk in practice how useful it is

bentice1y ago

Data is the wrong approach to develop reasoning. You we don't want LLM's to simply memorize 3x3 = 9 we want them to understand that 3 + 3 + 3 = 9 therefore 3x3 = 9 (obviously a trivial example). If they have developed reasoning very few examples should be needed.

The way I see it reasoning is actually the ability of the model to design and train smaller models that can learn with very few examples.

hackinthebochs1y ago

> If they have developed reasoning very few examples should be needed.

Yes, once the modules for reasoning have converged, it will take very few examples for it to update to new types of reasoning. But to develop those modules from scratch requires large amounts of examples that overtax its ability to memorize. We see this pattern in the "grokking" papers. Memorization happens first, then "grokking" (god I hate that word).

It's not like humans bootstrap reasoning out of nothing. We have a billion years of evolution that encoded the right inductive biases in our developmental pathways to quickly converge on the structures for reasoning. Training an LLM from scratch is like recapitulating the entire history of evolution in a few months.

dr_dshiv1y ago

My understanding is that, if you train these enough, it becomes likely to develop efficient compressions— which “reasoning” would be.

aithrowawaycomm1y ago

It's easy enough to generate an enormous amount of formal math problems, but utterly quixotic to generate an enormous amount of quantitative reasoning problems, which is the thing LLMs are lacking.

ninetyninenine1y ago

I don’t think so. The data is biased towards being very general.

resters1y ago· 6 in thread

I think it's obvious that LLMs will be able to do "reasoning" far better than humans. We must separate our notion of what is remarkably human. Rarely is it the reasoning, it's the intuition that a logical path exists -- for example a mathematical proof that draws from separate sub-disciplines of mathematics, etc.

Consider that in a LLM, language inputs are tokenized and fed as inputs into the neural network, and connections in the network create output sequences that are not just syntactically correct (trivial) or form semantically plausible sentences (early transformers did this). LLM output sequences follow the deep patterns of language which include sometjhing that resembles reasoning as the model has learnt from its training data.

LLMs seem to fall short because they often fail at truly abstract reasoning tasks that humans find easy. If trained properly, LLMs can develop advanced representations of logical systems that will surely outpace what humans can do in terms of raw reasoning.

However, human mathematicians have not even unified around constructive mathematics as a must for the study of mathematics. This reveals that even highly evolved mathematical disciplines rely on objects whose characteristics do not lend themselves to full logical scrutiny and are in a way socially constructed and effectively hard to audit.

While notation in mathematics is incredible technology it is also a highly limiting factor that suffers major tradeoffs. Humans struggle to invent new notation fast enough and to discard outdated notation fast enough. If we do see an AI-powered boom in mathematics, I suspect our notion of notation and the fluidity we demand from it will change dramatically.

islewis1y ago

This argument is centered around the belief that language and reasoning flow bidirectionally- language can be understood first (we are here), and reasoning is the next natural rung of the latter (your thesis believes we will get here with LLMs).

I see language more as a medium for transcribing reasoning. While language certainly communicates reasoning, you can have reasoning without language, but not language without reasoning.

This paper seems to imply that current LLM's are just copying the training dataset's reasoning communication, not understand the actual reasoning. I don't think LLM's moving past this is "obvious" or even close to being inevitable.

> Instead, LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts. While this process goes beyond naive memorization of words and the models are capable of searching and matching more abstract reasoning steps, it still falls short of true formal reasoning.

resters1y ago

I realize there is subtlety to the question of which is first. An infant, crying when it is hungry and pre-linguistic, is applying modus ponens. C -> F crying implies food, so I cry and then I get fed. Language grows in humans just like arms and legs, and so does reasoning. Baby animals do the same behavior but don't use language, so perhaps some logic is wired by instinct. Either way I don't think we need to worry about that detail.

Consider how language input to an LLM is tokenized. Now imagine a tokenization scheme that introduces tokens that track the strict logical reasoning in the language. Thus two completely different English sentences could both tokenize as the application of Modus Ponens over assumption 1 to conclude conclusion 2, for example.

Now consider that we can tokenize formal notation as used in mathematics and logic, and we can train LLMs on mathematical papers, peer review write-ups, etc. We can generate millions of correct proofs and teach it which ones are remarkable and why, etc.

Ultimately we run into the same barrier as mathematical constructivists run into, but I think it's still quite plausible that LLMs trained as I describe would be able to reason quite well and find oversights humans missed. However creating the optimal scheme and implementation is not trivial.

sottol1y ago

> If trained properly, LLMs can develop advanced representations of logical systems that will surely outpace what humans can do in terms of raw reasoning.

We have already trained the LLMs on most of the human knowledge base (so like 4-5000 years?) - imo training data will become a problem and will soon be more expensive than compute. Sure, you can work around some of this using synthetic training data but I personally would not count on general-purpose LLMs (especially LLMs aka transformer models) developing super-human representations of logical systems anytime soon.

resters1y ago

I don't disagree, however I'm optimistic because most of the current reasoning "ability" of LLMs comes from the accidental reasoning embedded in language patterns.

For example, the prompt completion: "The mouse has a unique digestive system compared to other rodents, however the sparrow" on GPT-4o is

"exhibits a highly specialized digestive system adapted for rapid processing of food, particularly seeds and insects, through structures like the crop and gizzard, which are not found in rodents."

Claude 3.5 completes it as

"has a completely different digestive anatomy as a bird. Birds like sparrows have adaptations for flight, including a lightweight skeletal system and a specialized digestive tract. Unlike mice, sparrows have a crop for storing food, a gizzard for grinding it, and generally shorter intestines to reduce weight. They also lack teeth, instead using their beak to manipulate food."

What appears to be a thoughtful contrast is merely a language pattern. Similarly, a prompt like "Assume -B, A->B. Under what circumstances is B true?" will simply follow the gradient to return output that is likely correct. Prompts like "what is 2+2" fail only because nobody bothers to write about it so simple arithmetic was not in the training data.

However the way that multi-modal LLMs handle images is inspiring as it effectively converts from the visual domain into the sequential token domain. The same could be done for symbolic systems, etc.

agentultra1y ago

I don’t see how it’s obvious that LLM’s will be capable of any mathematical, “reasoning.

LLM’s can infer relationships and maintain longer context chains in order to generate their output… it still happens that some times the output is correct depending on the training data, layers, context, etc. And it can get more accurate when we change the parameters of the model. But the algorithm isn’t “doing” anything here. It will generate something regardless of what it’s prompted with.

Maybe it’s right. But the algorithm is an algorithm. It doesn’t care what truth is. It’s generating BS essentially.

A human is doing a lot more work when performing mathematics.

It may be that LLM’s can be a useful tool in mathematical reasoning but it’s not obvious that it will ever be capable of it without a human, let alone be better than a human.

resters1y ago

I think models could be designed that in separate layers created "logical system" representations which could feed back into the output, much like how attention works. Attention is about relevance, the logical layers could be based on logical schema-based patterns.

Consider an LLM that happened to have some pre-trained layers that were trained abstractly on all the constructive proofs available for modern mathematics. LLMs with image recognition rely on existing visual pattern recognition layers, fwiw.

1 more reply

dev1ycan1y ago· 5 in thread

I don't understand the idiocracy we live in, it is beyond obvious not just that the stock market is a bubble but ESPECIALLY the AI related stocks are a massive bubble, when it pops, and it will, it is going to be very very ugly, yet people keep pouring in, as Sabine said it, it's starting to look like particle physics where they keep asking for bigger colliders, just because you have a bigger collider, if your methodology is flawed you aren't gonna get any more significant returns.

Eventually they will run out of exponential cash to pour in, and investors will start asking questions, stocks are already valued at 60x+ their earnings, whenever it pops you don't want to be the one who bought the top.

Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.

Workaccount21y ago

>if your methodology is flawed you aren't gonna get any more significant returns.

The problem with this statement is that predictions made about scaling 5 years ago have held true[1]. We keep adding parameters, adding compute, and the models keep getting more capable.

The flaws of LLM's from 2024 are not what is relevant. Just like the flaws of LLMs from 2021 were not relevant. What is relevant is the rate of change, and the lack of evidence that things won't continue on this steep incline. Especially if you consider that GPT4 was sort of a preview model that motivated big money to make ungodly investments to see how far we can push this. Those models will start to show up over the next 2 years.

If they break the trend and the scaling flops, then I think a lot of air is gonna blow out of the bubble.

[1]https://arxiv.org/pdf/2001.08361

vrighter1y ago

we added a lot of parameters.

We added a LOT of data.

The resulting models have become only slightly better. And they still have all of their old problems.

I think this is proof that scaling doesn't work. It's not like we just doubled the sizes, they increased by a lot, but improvements are less and less each time. And they've already run out of useful data.

dev1ycan1y ago

They are very literally asking for trillions and even nuclear powered data centers, pretty sure we've gotten to the point where it's not sustainable.

1 more reply

yoav_hollander1y ago

Exactly. I was assuming that the by now the default answer to "LLMs sort-of do this, but not very well" should be "OK, wait a few months".

empath751y ago

Computers have been able to do mathematical calculation and logical deduction cheaply and perfectly for decades, and it's not really required for generative AIs to be able to do it for them to be useful. It's good enough if they can write and execute some python code to do it, and generally they are fairly capable of that.

The question of whether they can do it is interesting in an academic sense, but has nothing to do if they're useful or not. They also don't need to be true AGI to be useful.

beardyw1y ago· 5 in thread

I honestly can't see why LLMs should be good at this sort of thing. I am convinced you need a completely different approach. At the very least you mostly only want one completely correct result. Good luck getting current models to do that.

hackinthebochs1y ago

LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.

qudat1y ago

One core issue is that we need to convert spoken/written languages (e.g. english) into more formal math languages since sometimes the underlying mathematical problem is written using prose. The example in the paper:

> When Sophie watches her nephew, she gets out a variety of toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?

So I would argue it's critical that LLMs knows how to convert text to math and then perform those math calculations. This extends beyond just math but also the underlying logics.

We just need to figure out how to inform the LLM to read, write, and understand formal languages. My guess is attention heads could probably work in this context, but we might want something that is a little more rigid, naturally extending from the rigidity of logic and formal languages. Conversely, we might not have figured out how to properly train LLMs on formal languages and have them preserve the underlying logic and axioms necessary to correctly perform math calculations.

s-macke1y ago

Well, my perspective on this is as follows:

The recurrent or transformer models are Turing complete, or at least close to being Turing complete (apologies, I’m not sure of the precise terminology here).

As a result, they can at least simulate a brain and are capable of exhibiting human-like intelligence. The "program" is the trained dataset, and we have seen significant improvements in smaller models simply by enhancing the dataset.

We still don’t know what the optimal "program" looks like or what level of scaling is truly necessary. But in theory, achieving the goal of AGI with LLMs is possible.

golol1y ago

I'm a math phd student at the moment and I regularly use o1 to try some quick calculations I don't feel like doing. While I feel like GPT-4o is so distilled that it just tries to know the answer from memory, o1 actually works with what you gave it and tries to calculate. It's can be quite useful.

banditelol1y ago

I'm curious what kind of quick calculation do you usually use llm for?

Edited for clarity

1 more reply

criddell1y ago· 2 in thread

It would be interesting if this kind of work could ever be extended to show the limitations of mathematical reasoning in animals and humans.

For example, just as a dog will never understand a fourier transform, there are likely ideas that humans cannot understand. If we know what our limits are, I wonder if we could build machines that can reason in ways we aren't capable of?

myrmidon1y ago

I think it is a naive assumption that such a limitation even exists ("exists" in a sense that it is actually useful, by being consistent and somewhat simple to describe).

We investigated similar ideas for language (=> Noam Chomsky), where we tried to draw clear, formalized limits for understanding (to show e.g. how human capabilities contrast with animals). The whole approach failed completely and irredeemably (personal opinion), but researching it was far from useless to be fair.

r2_pilot1y ago

As the human brain is finitely bounded in space and time, any idea that can't be compressed or represented by condensing notation, which is "larger" than the 100B cells+100T synapses can represent, or whose integration into said human's brain would take longer than 150 years, would be considered unable to be contemplated by a normal human.

2 more replies

singularity20011y ago· 2 in thread

If the argument is that LLMs are bad at reasoning because they are easily distractible and the results vary with modifications in the question, one should be reminded of the consistency and distractability of humans.

riku_iki1y ago

Trained human can tell if distracted: "I am distracted and can't figure out answer", while LLM will confidently gives you wrong answer, which makes whole results not reliable.

zeroonetwothree1y ago

Why? LLMs are supposedly better than humans (as many comments claim in this thread).

apsec1121y ago· 2 in thread

()

ilaksh1y ago

That makes the whole conclusion obviously false.

I don't really understand why, but I think we are going to see total denial from a significant percentage of the population all the way up to and past the point where many average mathematicians and software engineers cannot in any way compete with AI.

We already are reportedly getting pretty close with o1 (not o1-preview).

There are also new paradigms for machine learning and hardware in the pipeline that will continue to provide orders of magnitude performance gains and new capabilities in the next 5-10 years.

Many people still claim that "self driving cars don't exist", in so many words, even though they are deployed in multiple cities.

sottol1y ago

> Many people still claim that "self driving cars don't exist", in so many words, even though they are deployed in multiple cities.

But just look at the predictions of that time - cities will change, ... and so on. Sure, we have self-driving cars but the reality looks very different (and a lot more like the past!) than the pundits and futurists imagined! I'm not sure anyone will make their billions of dollars investmented back within even 20 years.

Just two random examples from ~10 years ago (2013-2016), you can google many more of that time.

* "Ford Targets Fully Autonomous Vehicle for Ride Sharing in 2021; Invests in New Tech Companies, Doubles Silicon Valley Team" [1]

* "Disruptions: How Driverless Cars Could Reshape Cities" [2]

[1] https://media.ford.com/content/fordmedia/fna/us/en/news/2016...

[2] https://archive.nytimes.com/bits.blogs.nytimes.com/2013/07/0...

[3] https://www.gensler.com/dialogue/30/the-game-changer-for-cit...

ak_1111y ago· 1 in thread

As an outsider can anyone enlighten me how this squares with the news that models that adapt similar LLM architecture can obtain silver medal in mathematical olympiad?

lionkor1y ago

careful statistical massaging, maybe.

would you pick only winning results and only present favorable, massaged results if it got you 150+B USD of worth?

qwerty4561271y ago· 1 in thread

Can't al LLM just detect a mathematical reasoning task then produce a formula (not even display it in the production mode) to invoke on an external service engineered for formal logical and mathematical computations?

aithrowawaycomm1y ago

In many of these examples it produces the wrong formula because it misunderstands the word problem, so a computer algebra system wouldn't help - garbage in, garbage out.

The problem here is more serious than mathematics: the quantitative reasoning itself is highly unreliable.

jumploops1y ago· 1 in thread

> Overall, while o1-preview and o1-mini exhibit significantly stronger results compared to current open models—potentially due to improved training data and post-training procedures—they still share similar limitations with the open models.

tl;dr - the best open model dropped from 89.7% on GSM8K(full) to 30% on Symbolic-NoOp, while o1-preview dropped from 94.9% to 77.4%, respectively.

I think all this paper shows is that LLMs need space to "think" outside of their inference layer, (for the current architectures at least).

It's similar to the "draw a room, but DO NOT put an elephant in the corner" prompts that people were using with image models.

This is something that practitioners have been doing for awhile (via CoT, ToT, etc.) and the whole rationale behind OpenAI's newly launched o1-series "model."

There's another post that says this paper proves LLMs can't be used to build "reliable agents" -- which doesn't appear to be true when you look at o1's stellar performance here.

data_maan1y ago

Can you send a paper regarding that LLMs can build "reliable agents"?

trehalose1y ago

I see a lot of discussion about irrelevant clauses tripping up the LLMs and why that does or doesn't matter. To me, what's far more damning is this:

> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.

This seems like irrefutable evidence of overfitting, that in the best case scenario is epidemic among current LLMs (and in the worst case interpretation, is covering up fundamental inabilities to learn mathematical reasoning from the training data).

thenoblesunfish1y ago

Very interesting, and aligns with what I would expect in terms of the type of "thinking" LLMs do. I think that it's also the type of "thinking" that will let a student pass most school courses, except of course for the ones where the teacher has taken the time to pose test questions that aren't as amenable to pattern matching. (Hard, but I assume most readers here are familiar with leetcode style interviews and what makes questions of that kind higher or lower quality for assessing candidates)

(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)

codelion1y ago

This is surprising to only those that have not worked in formal reasoning. Yes, LLMs cannot do true logical reasoning in a formal sense, you can do better with an SMT solver. But it is also true that you can solve a lot of logical problems by just applying “reasoning steps” from the training data, specially when your training data is the entirety of written content ever produced. Both of these can be true at the same time it is not a contradiction just an interesting dichotomy.

dang1y ago

Related ongoing thread:

LLMs don't do formal reasoning - https://news.ycombinator.com/item?id=41812523 - Oct 2024 (70 comments)

K0balt1y ago

Trying to solve (much less explore) mathematics using probabilistic next-token prediction seems like the really long way around, especially when we have pretty good deterministic tools available for our use. I don’t know why anyone would bother doing anything besides working on the correct manipulation of tools.

Brains have various structures that have distinct architectures. I don’t see any indication that the best way forward is to try to shoehorn everything into a single computational paradigm.

It’s like trying to make a flying submarine car. It might technically be possible, but it might not be worth the trouble, and it’s unlikely to result in a vehicle that works excellently in any of its environments.

gradientsrneat1y ago

Could this be Goodhart's Law in action? AI tools like to showcase benchmarks in bar graphs to show how well they perform compared to other models.

Maybe the benchmark Qs/As snuck into training sets accidentally. Is it still Goodhart's Law if it's unintentional?

Daniel Lemire has blogged about being impressed with how well the LLM answers his CS problem questions. I was impressed too. Not sure where the line of competence lies.

eigenform1y ago

The difference is that, if we are solving a math problem together, you and I [explicitly or implicitly] can come to an agreement over the context and decide to restrict our use of language with certain rules. The utility behind our conversation [generally] rests on those rules!

An LLM is very good at recovering rules, but being good at pattern recognition is not the same thing as being good at unambiguously following rules in the appropriate context.

edit: Natural language is far from an efficient/sufficient/necessary intermediate representation for doing math, just ask any general-purpose computer. Sometimes, it's worth "putting rules in stone," and it seems unreasonable to believe that there is always an unambiguous rule for this that you can mechanically recover from a corpus of language use.

i0071y ago

LLMs are designed to carry out "associative reasoning" which captures logic based on recognition and recall of compositional patterns learned during training.

Having said that, we can still get semantically and logically idempotent output that makes sense but with lots of work outside of the LLM, which contrasts with the current hyper focus on the LLM itself as the be all and end all. It is just one component in what ought to be a larger and more involved system for reasoning.

Look at what we were able to accomplish here for Legal AI, not so mathematical logic per se but mimicking (capturing) axiomatic logic in the legal domain:

https://www.youtube.com/watch?v=_9Galw9-Z3Q

marc at sunami dot ai

jgord1y ago

I propose 'gords rule' : "any sufficiently advanced LLM will learn the laws of logic, the principles of scientific method, and Reinforcement Learning"

until that happens .. I think RL startups focused on real problems are much undervalued : https://quantblog.wordpress.com/2024/10/11/llm-hype-means-th...

gtsop1y ago

LLMs are inherently emulators of digitaly imprinted artifacts of human consciousness. When people trully grasp what this means they will stop being buffled by the fact that LLMs performance deteriorate when novelty of the task increases.

EDIT: Had there been an ounce of actual true reasoning emerging in LLMs, openai would have been running this thing privatly 24/7 to produce new science and capture pattents that would give them economic dominance. Not trying to sell tokens to us all.

uptownfunk1y ago

The very fundamental problem with LLM is there is no guarantee on any of the reasoning it gives you without a human there to give a thumbs up. They are working on solving this (alpha proof, lean agent etc) but getting this to run at inference time in an optimized way is what I would call one of the millenial prize problems of AI which will lead to a quantum leap in the path towards the singularity.

woopwoop1y ago

I'm curious about what happens with the no-op dataset if you include in the prompt that the questions may contain irrelevant information.

teleforce1y ago

In terms of usefulness and realistic implementation mathematical reasoning is the next frontier of LLM not autonomous level 5 driving or AGI. More research fund and investment are much better spent on the former rather than the latter but apparently it seems that the reverse situation is the case.

Animats1y ago

It's an expected result.

Whatever happened with that result which found some representation of the state of a game inside an LLM? That indicated some degree of model-building. Haven't heard about that again/

bubble123451y ago

Can LLMs even do addition, with say 20+ digit numbers? Multiplication?

throwaway9182991y ago

limitations of mathematical reasoning?

They have none. Literally zero. That’s the limit. Thank you for reading my paper.

j / k navigate · click thread line to collapse

266 comments

144 comments · 31 top-level

parsimo20101y ago· 32 in thread

ojosilva1y ago

LLM gets things right, when it does, due to the sheer massive information ingested during training, it can use probabilities to extract a right answer from deep in the model.

heresie-dabord1y ago

> Humans on the other hand have developed a more elaborate scheme to process, or reason [ ... ] We listen to some explanations, a YT video, a few exercises

Frequent repetition in the sociological context has been the learning technique for our species. To paraphrase Feynman, learning is transferring.

ben_w1y ago

I think the larger models are consuming in the order of 100k as much as we do, and while they have a much broader range of knowledge, it's not 100k as much breadth.

1 more reply

pishpash1y ago

Nah, human failures look equally nonsensical. You're just more attuned to use their body language or peer judgement to augment your reception. Really psychotic humans can bypass this check.

wkirby1y ago

> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.

To me, this directly contradicts your conclusion: LLMs are mostly only capable of misleading large portions of the population.

pishpash1y ago

Would be good to put equivalent grades on LLM's then. Instead of GPT-4o, it's GPT-11th grade.

2 more replies

Eisenstein1y ago

This is not inherent in the LLM though. Society will adjust to it after learning some very predictable (and predicted) lessons, just like it always does.

3 more replies

hintymad1y ago

> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.

MVissers1y ago

Which model? The field moves so fast it’s hard to validate statements like this without that info.

O1-preview?

1 more reply

ActorNightly1y ago

> I won't take a strong stance on whether or not LLMs actually do reasoning,

atleastoptimal1y ago

Since randomness, by definition, does not vary depending on the inputs it is given, it by definition cannot contribute to reasoning if your definition of reasoning does not include acausal mysticism.

1 more reply

growthwtf1y ago

I don't see how the latter follows from the former.

It might be a mimicry of reasoning, but I don't think that having adjustable parameters on how random they are makes it any less of one.

But, the fact they have that randomness parameter seems to be to be totally unrelated to any of the above thoughts or merits about the models having reasoning abilities.

2 more replies

int_19h1y ago

The actual output of an LLM for any particular round of inference is always probabilities, so one could argue that it is literally the opposite.

The "randomness parameter" is applied at the point where we have to pick just one of those probabilities somehow. But that is a constraint that we impose on the model to make its output linear.

mewpmewp21y ago

1 more reply

kromem1y ago

Try the following prompt with Claude 3 Opus:

Try it on temp 1.0, try it dozens of times. Let me know when you get "big spoon" as an answer.

Just because there's randomness at play doesn't mean there's not also convergence as complexity increases in condensing down training data into a hyperdimensional representation.

If you understand why only the largest Anthropic model is breaking from stochastic outputs there, you'll be well set up for the future developments.

1 more reply

anonzzzies1y ago

And the mechanism in your head doesn't do this? How do you know?

kkzz991y ago

"deterministally outputting information" neither do humans.

skydhash1y ago

hintymad1y ago

> Not to disparage American school system (my country’s is worse) but it’s very much easy mode

BriggyDwiggs421y ago

3 more replies

debit-freak1y ago

> In other words, average Americans exhibit similar limitations on their reasoning as good LLMs.

vasilipupkin1y ago

FabHK1y ago

Are college students more likely to get it wrong when you change the numbers from the example problem (as reported here for LLMs)?

sdenton41y ago

You can absolutely psych students out by adding weird numbers to a problem, yes.

elicksaur1y ago

>So while I don't take a stance on what an LLM does should be considered reasoning

>I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence

This is taking a stance.

fhe1y ago

mdp20211y ago

> Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs

Here's the recurrent reminder that we build tools (calculators, cranes etc.) to outperform the strong, not the weak.

1 more reply

richerram1y ago

zeroonetwothree1y ago

This must be some bizarre definition of “smarter”.

2 more replies

goatlover1y ago

Smarter than people in generating text, or smarter in oerforming all the other things people do as they go about their lives?

1 more reply

lupire1y ago

Can an AI walk and chew gum at the same time?

1 more reply

gosub1001y ago

> They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance

woopwoop1y ago· 16 in thread

aithrowawaycomm1y ago

aguaviva1y ago

Indeed, and the ability to make heads or tails of slightly-slippery problems of this sort is an extremely important real-world math skill. It's not extraneous at all.

This whole attention business, they're calling it.

2 more replies

swatcoder1y ago

Real discourse has tons of irrelevant information for all sorts of reasons.

There are some contexts, academic or professional, where questions are posed carefully and specifically, but these are narrow contexts.

A useful general purpose assistant needs to be able to find what's relevant among what's irrelevant.

Excellence at just solving math problems that are especially well specified can be a useful domain assistant (no small win!), but is not the same thing.

woopwoop1y ago

nosianu1y ago

> Real discourse has tons of irrelevant information for all sorts of reasons.

Real discourse was not carefully crafted to test you.

In a real discourse You can also go back and forth with the other person to get clarification, and errors don't matter because they are temporary on both sides.

meroes1y ago

Irrelevant info is taught in grade skill and is a skill for the SAT for example.

Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.

The point is having an answer that you can defend logically and most people would agree.

I say this as a RLHF’er who sees and is told to write similar questions at times.

At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.

sottol1y ago

jfrbfbreudh1y ago

I think it’s an important result because filtering signal from noise is just as, if not more, important than forming conclusions from signal.

hggigg1y ago

That's not even the problem I encounter. They literally crap out on stupidly simple tasks. Recent ones:

1. Bing was gaslighting me into 9.11 being greater than 9.9

2. ChatGPT said that 7x7/7+7/7+7/7 was 24.

3. When expanding (x+1)^2 the output was 2x^2+2.

Regardless of any level of interpretation and irrelevant information if it can't deterministically understand correctness and the semantics of the operations in question then it's fucking useless.

What is worse in an educational context is that it is actively harmful.

MVissers1y ago

Most average humans can’t do any of these things either. Try asking people on the street. Or in an average US college student.

For deterministic calculations you obviously want to allow LLMs to use tools to do math. Just like you’d want to allow humans to use calculators.

So yeah, you shouldn’t ask LLMs to do math just like you shouldn’t ask average people to do math. They both suck at it.

2 more replies

mdp20211y ago

> LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information

"Attention is all you need" /

(It is part of the general problem solving process to evaluate what is relevant and what is not.)

moffkalast1y ago

Differential attention that filters out noise is all you need :)

andoando1y ago

Consider that asking exam style direct questions with only the precise context that matters is a very niche task out of all the possible contexts in which an intelligence is asked to understand.

WhitneyLand1y ago

I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.

Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?

capkutay1y ago

I agree that it's not particularly surprising that if you try to trick an LLM with irrelevant text will make it perform worse.

I don't see this as an material limitation of LLMs but rather something that can be addressed at the application level to strip out irrelevant information.

wslh1y ago

bob10291y ago· 12 in thread

I'd offer a simpler explanation: Tokenization.

If you tokenize "12345 * 27271" you will get the following:

  "123", "45", " *", " ", "272", "71"

The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.

You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".

soulofmischief1y ago

[0] https://arxiv.org/abs/2301.05217

pfortuny1y ago

ttul1y ago

I respectfully disagree.

While tokenization certainly plays a role in how language models process input, it's simplistic to attribute the challenges in mathematical reasoning solely to tokenization.

The decline in performance as complexity increases might be due to other factors, such as:

And in any case, I think OpenAI’s o1 models are crushing it in math right now. The iterative, model-guided CoT approach seems to be able to handle very complex problems.

m3kw91y ago

1 more reply

andrepd1y ago

>And in any case, I think OpenAI’s o1 models are crushing it in math right now.

I challenge you to give it even a simple, but original, problem to solve.

3 more replies

TZubiri1y ago

Wouldn't a slight change in tokenization? (say mapping single digits to single tokens) help with this specific challenge?

wenc1y ago

Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.

Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.

1 more reply

bob10291y ago

Context-specific tokenization sounds a lot like old fashioned programming.

m3kw91y ago

The llm will know 123 and 45 is a contiguious number just like how humans can tell if you say 123 and then a slight pause 45 as a single number

TZubiri1y ago

But for maths, it doesn't seem appropriate.

I wonder what the effect of forcing tokenization for each separate digit be.

1 more reply

soulofmischief1y ago

sva_1y ago

It won't 'see' [123, 45] though, but [7633, 2548], or rather sparse vectors that are zero at each but the 7634th and 2549th position.

yk1y ago· 10 in thread

getoffmyyawn1y ago

I've found that the River Crossing puzzle is a great way to show how LLMs break down.

Ask this version, "A farmer has a spouse, chicken, cabbage, and baby with them. The farmer needs to get them all across the river in their boat. What is the best way to do it?"

In my tests the LLMs nearly always assume that the boat has a carry-restriction and they come up with wild solutions involving multiple trips.

chasd001y ago

Analemma_1y ago

jprete1y ago

layer81y ago

SonOfLilit1y ago

I've been using this as my first question to any new LLM I try and I'm quite sure nothing before GPT-4 even got close to a correct solution. Can you post a prompt that GPT-2 or 3 can solve?

andrepd1y ago

Meaning it's just a glorified Google.

romwell1y ago

...that makes up results when it can't find any

voidUpdate1y ago

I'm scared of the cows around you if they eat goats

Manabu-eo1y ago

I think their point is that cows don't eat goats, unlike wolves, and that causes the LLMs to answer it wrong.

s-macke1y ago· 9 in thread

You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.

[1] https://arxiv.org/html/2406.02061v1

[2] https://news.ycombinator.com/item?id=40811329

oliwary1y ago

Someone (https://x.com/colin_fraser/status/1834336440819614036) shared an example that I thought was interesting relating to their reasoning capabilities:

A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

layer81y ago

> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

1 more reply

nosianu1y ago

> A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

> Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem

I admit I don't understand a single thing about this "problem". To me, it's just some statement.

Do I need some cultural knowledge for this?

tgv1y ago

1 more reply

s-macke1y ago

> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

We do, but we can generalize better. When you exchange "hospital" with "medical centre" or change the sentence structure and ask humans, the statistics would not be that different.

But for LLMs, that might make a lot of difference.

apsec1121y ago

Both Claude-3.5 and o1-preview nail this problem

"Let's think through this step-by-step:

1. Alice has 3 brothers 2. Alice has 2 sisters 3. We need to find out how many sisters Alice's brother has

The key here is to realize that Alice's brothers would have the same sisters as Alice, except they would also count Alice as their sister.

So, Alice's brothers would have: - The 2 sisters Alice has - Plus Alice herself as a sister

Therefore, Alice's brothers have 3 sisters in total."

s-macke1y ago

For the "Alice in Wonderland" paper, neither Claude-3.5 nor o1-preview was available at that time.

But I have tested them as well a few weeks ago with the issue translated into German, achieving also a 100% success rate with both models.

However, when I add irrelevant information (My mother ...), Claude's success rate drops to 85%:

"My mother has a sister called Alice. Alice has 2 sisters and 1 brother. How many sisters does Alice's brother have?"

3 more replies

einarfd1y ago

My problem with this puzzle, is how do you know that Alice and her brothers share both parents?

Is it not correct English to call two people who share one parent, sisters, or brothers?

I guess I could be misguided by my native Norwegian where you have to preamble the word with "hell" (full), or "halv" (half), if you want to specify the number of shared parents.

2 more replies

s-macke1y ago

Here is the larger discussion about the Alice in Wonderland Paper on Hacker News.

https://news.ycombinator.com/item?id=40585039

dr_dshiv1y ago· 9 in thread

It seems incredibly easy to generate an enormous amount of synthetic data for math. Is that happening? Does it work?

ilaksh1y ago

They are taking poor performance of undersized models and claiming that proves some fundamental limitation of large models, even though their own tests show that isn't true.

foobarqux1y ago

You choose to ignore Figure 8 which shows a 18% drop when simply adding an irrelevant detail.

1 more reply

MacsHeadroom1y ago

Yes, this is how o1 was trained. Math and programming, because they are verifiable.

This is also why o1 is not better at English. Math skills transfer to general reasoning but not so much to creative writing.

Davidzheng1y ago

bentice1y ago

The way I see it reasoning is actually the ability of the model to design and train smaller models that can learn with very few examples.

hackinthebochs1y ago

> If they have developed reasoning very few examples should be needed.

dr_dshiv1y ago

My understanding is that, if you train these enough, it becomes likely to develop efficient compressions— which “reasoning” would be.

aithrowawaycomm1y ago

It's easy enough to generate an enormous amount of formal math problems, but utterly quixotic to generate an enormous amount of quantitative reasoning problems, which is the thing LLMs are lacking.

ninetyninenine1y ago

I don’t think so. The data is biased towards being very general.

resters1y ago· 6 in thread

islewis1y ago

I see language more as a medium for transcribing reasoning. While language certainly communicates reasoning, you can have reasoning without language, but not language without reasoning.

resters1y ago

sottol1y ago

> If trained properly, LLMs can develop advanced representations of logical systems that will surely outpace what humans can do in terms of raw reasoning.

resters1y ago

I don't disagree, however I'm optimistic because most of the current reasoning "ability" of LLMs comes from the accidental reasoning embedded in language patterns.

For example, the prompt completion: "The mouse has a unique digestive system compared to other rodents, however the sparrow" on GPT-4o is

"exhibits a highly specialized digestive system adapted for rapid processing of food, particularly seeds and insects, through structures like the crop and gizzard, which are not found in rodents."

Claude 3.5 completes it as

However the way that multi-modal LLMs handle images is inspiring as it effectively converts from the visual domain into the sequential token domain. The same could be done for symbolic systems, etc.

agentultra1y ago

I don’t see how it’s obvious that LLM’s will be capable of any mathematical, “reasoning.

Maybe it’s right. But the algorithm is an algorithm. It doesn’t care what truth is. It’s generating BS essentially.

A human is doing a lot more work when performing mathematics.

It may be that LLM’s can be a useful tool in mathematical reasoning but it’s not obvious that it will ever be capable of it without a human, let alone be better than a human.

resters1y ago

1 more reply

dev1ycan1y ago· 5 in thread

Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.

Workaccount21y ago

>if your methodology is flawed you aren't gonna get any more significant returns.

The problem with this statement is that predictions made about scaling 5 years ago have held true[1]. We keep adding parameters, adding compute, and the models keep getting more capable.

If they break the trend and the scaling flops, then I think a lot of air is gonna blow out of the bubble.

[1]https://arxiv.org/pdf/2001.08361

vrighter1y ago

we added a lot of parameters.

We added a LOT of data.

The resulting models have become only slightly better. And they still have all of their old problems.

dev1ycan1y ago

They are very literally asking for trillions and even nuclear powered data centers, pretty sure we've gotten to the point where it's not sustainable.

1 more reply

yoav_hollander1y ago

Exactly. I was assuming that the by now the default answer to "LLMs sort-of do this, but not very well" should be "OK, wait a few months".

empath751y ago

The question of whether they can do it is interesting in an academic sense, but has nothing to do if they're useful or not. They also don't need to be true AGI to be useful.

beardyw1y ago· 5 in thread

hackinthebochs1y ago

qudat1y ago

So I would argue it's critical that LLMs knows how to convert text to math and then perform those math calculations. This extends beyond just math but also the underlying logics.

s-macke1y ago

Well, my perspective on this is as follows:

The recurrent or transformer models are Turing complete, or at least close to being Turing complete (apologies, I’m not sure of the precise terminology here).

We still don’t know what the optimal "program" looks like or what level of scaling is truly necessary. But in theory, achieving the goal of AGI with LLMs is possible.

golol1y ago

banditelol1y ago

I'm curious what kind of quick calculation do you usually use llm for?

Edited for clarity

1 more reply

criddell1y ago· 2 in thread

It would be interesting if this kind of work could ever be extended to show the limitations of mathematical reasoning in animals and humans.

myrmidon1y ago

I think it is a naive assumption that such a limitation even exists ("exists" in a sense that it is actually useful, by being consistent and somewhat simple to describe).

r2_pilot1y ago

2 more replies

singularity20011y ago· 2 in thread

riku_iki1y ago

Trained human can tell if distracted: "I am distracted and can't figure out answer", while LLM will confidently gives you wrong answer, which makes whole results not reliable.

zeroonetwothree1y ago

Why? LLMs are supposedly better than humans (as many comments claim in this thread).

apsec1121y ago· 2 in thread

()

ilaksh1y ago

That makes the whole conclusion obviously false.

We already are reportedly getting pretty close with o1 (not o1-preview).

There are also new paradigms for machine learning and hardware in the pipeline that will continue to provide orders of magnitude performance gains and new capabilities in the next 5-10 years.

Many people still claim that "self driving cars don't exist", in so many words, even though they are deployed in multiple cities.

sottol1y ago

> Many people still claim that "self driving cars don't exist", in so many words, even though they are deployed in multiple cities.

Just two random examples from ~10 years ago (2013-2016), you can google many more of that time.

* "Ford Targets Fully Autonomous Vehicle for Ride Sharing in 2021; Invests in New Tech Companies, Doubles Silicon Valley Team" [1]

* "Disruptions: How Driverless Cars Could Reshape Cities" [2]

[1] https://media.ford.com/content/fordmedia/fna/us/en/news/2016...

[2] https://archive.nytimes.com/bits.blogs.nytimes.com/2013/07/0...

[3] https://www.gensler.com/dialogue/30/the-game-changer-for-cit...

ak_1111y ago· 1 in thread

As an outsider can anyone enlighten me how this squares with the news that models that adapt similar LLM architecture can obtain silver medal in mathematical olympiad?

lionkor1y ago

careful statistical massaging, maybe.

would you pick only winning results and only present favorable, massaged results if it got you 150+B USD of worth?

qwerty4561271y ago· 1 in thread

aithrowawaycomm1y ago

In many of these examples it produces the wrong formula because it misunderstands the word problem, so a computer algebra system wouldn't help - garbage in, garbage out.

The problem here is more serious than mathematics: the quantitative reasoning itself is highly unreliable.

jumploops1y ago· 1 in thread

tl;dr - the best open model dropped from 89.7% on GSM8K(full) to 30% on Symbolic-NoOp, while o1-preview dropped from 94.9% to 77.4%, respectively.

I think all this paper shows is that LLMs need space to "think" outside of their inference layer, (for the current architectures at least).

It's similar to the "draw a room, but DO NOT put an elephant in the corner" prompts that people were using with image models.

This is something that practitioners have been doing for awhile (via CoT, ToT, etc.) and the whole rationale behind OpenAI's newly launched o1-series "model."

There's another post that says this paper proves LLMs can't be used to build "reliable agents" -- which doesn't appear to be true when you look at o1's stellar performance here.

data_maan1y ago

Can you send a paper regarding that LLMs can build "reliable agents"?

trehalose1y ago

I see a lot of discussion about irrelevant clauses tripping up the LLMs and why that does or doesn't matter. To me, what's far more damning is this:

> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.

thenoblesunfish1y ago

(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)

codelion1y ago

dang1y ago

Related ongoing thread:

LLMs don't do formal reasoning - https://news.ycombinator.com/item?id=41812523 - Oct 2024 (70 comments)

K0balt1y ago

Brains have various structures that have distinct architectures. I don’t see any indication that the best way forward is to try to shoehorn everything into a single computational paradigm.

gradientsrneat1y ago

Could this be Goodhart's Law in action? AI tools like to showcase benchmarks in bar graphs to show how well they perform compared to other models.

Maybe the benchmark Qs/As snuck into training sets accidentally. Is it still Goodhart's Law if it's unintentional?

Daniel Lemire has blogged about being impressed with how well the LLM answers his CS problem questions. I was impressed too. Not sure where the line of competence lies.

eigenform1y ago

An LLM is very good at recovering rules, but being good at pattern recognition is not the same thing as being good at unambiguously following rules in the appropriate context.

i0071y ago

LLMs are designed to carry out "associative reasoning" which captures logic based on recognition and recall of compositional patterns learned during training.

Look at what we were able to accomplish here for Legal AI, not so mathematical logic per se but mimicking (capturing) axiomatic logic in the legal domain:

https://www.youtube.com/watch?v=_9Galw9-Z3Q

marc at sunami dot ai

jgord1y ago

I propose 'gords rule' : "any sufficiently advanced LLM will learn the laws of logic, the principles of scientific method, and Reinforcement Learning"

until that happens .. I think RL startups focused on real problems are much undervalued : https://quantblog.wordpress.com/2024/10/11/llm-hype-means-th...

gtsop1y ago

uptownfunk1y ago

woopwoop1y ago

I'm curious about what happens with the no-op dataset if you include in the prompt that the questions may contain irrelevant information.

teleforce1y ago

Animats1y ago

It's an expected result.

Whatever happened with that result which found some representation of the state of a game inside an LLM? That indicated some degree of model-building. Haven't heard about that again/

bubble123451y ago

Can LLMs even do addition, with say 20+ digit numbers? Multiplication?

throwaway9182991y ago

limitations of mathematical reasoning?

They have none. Literally zero. That’s the limit. Thank you for reading my paper.

j / k navigate · click thread line to collapse