I then looked it up and they had each copy/pasted the same Stack overflow answer.
Furthermore, the answer was extremely wrong, the language I used was superficially similar to the source material, but the programming concepts were entirely different.
What this tells me is there is clearly no “reasoning” happening whatsoever with either model, despite marketing claiming as such.
Not true. You yourself have failed at reasoning here.
The problem with your logic is that you failed to identify the instances where LLMs have succeeded with reasoning. So if LLMs both fail and succeed it just means that LLMs are capable of reasoning and capable of being utterly wrong.
It's almost cliche at this point. Tons of people see the LLM fail and ignore the successes then they openly claim from a couple anecdotal examples that LLMs can't reason period.
Like how is that even logical? You have contradictory evidence therefore the LLM must be capable of BOTH failing and succeeding in reason. That's the most logical answer.
Apple’s recent research summarized here [0] is worth a read. In short, they argue that what LLMs are doing is more akin to advanced pattern recognition than reasoning in the way we typically understand reasoning.
By way of analogy, memorizing mathematical facts and then correctly recalling these facts does not imply that the person actually understands how to arrive at the answer. This is why “show your work” is a critical aspect of proving competence in an education environment.
An LLM providing useful/correct results only proves that it’s good at surfacing relevant information based on a given prompt. That fact that it’s trivial to cause bad results by making minor but irrelevant changes to a prompt points to something other than a truly reasoned response, i.e. a reasoning machine would not get tripped up so easily.
It’s bloody obvious that when I classify success I mean that the llm is delivering a correct and unique answer for a novel prompt that doesn’t exist in the original training set. No need to go over the same tired analogies that have been regurgitated over and over again that you believe LLMs are reusing memorized answers. It’s a stale point of view. The overall argument has progressed further then that and we now need more complicated analysis of what’s going on with LLMs
Sources: https://typeset.io/papers/llmsense-harnessing-llms-for-high-...
https://typeset.io/papers/call-me-when-necessary-llms-can-ef...
And these two are just from a random google search.
I can find dozens and dozens of papers illustrating failures and successes of LLMs which further nails my original point. LLMs both succeed and fail at reasoning.
The main problem right now is that we don’t really understand how LLMs work internally. Everyone who claims they know LLMs can’t reason are just making huge leaps of irrational conclusions because not only does their conclusion contradict actual evidence but they don’t even know how LLMs work because nobody knows.
We only know how LLMs work at a high level and we only understand these things via the analogy of a best fit curve in a series of data points. Below this abstraction we don’t understand what’s going on.
The evidence is using one instance of the LLM parroting training data while completely ignoring contradicting evidence where the LLM created novel answers to novel prompts out of thin air.
>Observations trump claims.
No. The same irrational hallucinations that plague LLMs are plaguing human reasoning and trumping rational thinking.
The condition of “some people are bad at thing” does not equal “computer better at thing than people”, but I see this argument all the time in LLM/AI discourse.
It could be said not as well as the ones that don't need SO.
The most interesting thing about LLMs is probably how much relational information turns out to be encoded in large bodies of our writing, in ways that fancy statistical methods can access. LLMs aren’t thinking, or even in the same ballpark as thinking.