It probably knows that the token "1" has the relationship "is less than" with the token "2" — but that's because it has "1" and "2" as reified concepts, each with many different facts and properties and relationships associated directly with those tokens-as-vertices.
"$105,000", meanwhile, is just a lexeme. It maybe knows, due to pre-parsing, that it's "an amount of dollars" — and maybe it even recognizes its order-of-magnitude. It can therefore likely make any statement that takes the token "$105,000" as a meta-syntactic variable standing in for some unknown "amount of dollars." But there's no little numeric model embedded inside the language model that would tell it how many dollars, or be able to compare dollars against dollars.
You’re directionally right I suppose, in that LLMs have a structural disadvantage due to the architecture and don’t always get the correct answer. But you seem to be claiming that a LLM could never do maths, which is trivially false.
https://chat.openai.com/share/69e4e673-ba78-412a-a8a7-a1b2f8...
But presuming that wasn't the critical point you wanted to make:
Like I said, a language model can know that "1" "is less than" "2" — and it can also know (if it's either trained with characters as lexemes, or is given access to a pre-parse output to second-chance analyze unknown tokens) that "10" is the same thing as (1 tens). Which then means that it can know that "23" "is less than" "48" because it can do linguistic deductive tricks between the terms (2 tens plus 3 ones) and (4 tens plus 8 ones).
But those tricks are tricks. It isn't doing math; it's applying "2" as an adjective to "tens", constructing a verb phrase whose verb is "plus", and then (likely) interpreting your question as a question about analogy. It knows that (2 pineapples) "is less than" (3 pineapples) by analogy — (N of some unit) "is analogous to" N-the-number. But it doesn't know that "tens" is a special unit distinct from "pineapples" in that it changes the meaning of the number-token it's attaching to.
To put it another way: a (pure) language model has no way of encoding numbers that allows it to actually do math and get correct results out. It can memorize tables of answers for well-known numbers, and it can try to use language tricks to combine those tables, but it can't perform an algorithm on a number, because no part of its architecture allows the nodes in its model to act as a register to encode an (arbitrarily large) number in such a way that it is actually amenable to numeric operations being performed on that data.
A model that is really modelling numbers, should be able to apply any arbitrary algorithm it knows about to those numbers, just like a regular CPU can apply any instruction sequence it reads to its registers. Not just add/sub, or mul/div, but arbitrarily-complex things like e.g. iterated modular exponentiation, should just be a matter of saying "hey LLM, you remember the algorithm for doing MOD-EXP, right? So tell me...."
(Note that humans can't do this kind of math purely "in our heads" any more than LLMs can, because we don't have any low-level accelerative infrastructure for modelling and working with numeric data either! We need an external buffer that inherently embeds sequencing/positioning info — like our auditory sensory "loop" memory from [sub]verbally repeating the working data; or our visual sensory persistence-of-vision memory, from writing the data down onto a piece of paper and staring at it as we work.)
This logic applies to any function an LLM may perform, therefore it cannot perform any function, which is absurd. Indeed there functionally are little circuits inside LLMs; its pretraining amounts to formation of those circuits. [1]
LLaMA tokenizer parses integers as individual tokens and allows the model to see that "$105,000" is [$][1][0][5][,][0][0][0], which makes it more than capable of learning arithmetic for the general case, and with finetuning even the smallest LLaMA-1 can learn to answer better than GPT-4.[2]
It still fails a lot, but so do people, who definitely understand arithmetic on every level from practical to conceptual to algorithmic. The problem is more that it does not have a calculator inside and cannot (by default) use any explicit cache for intermediate values.
1. https://transformer-circuits.pub/2021/framework/index.html
Clearly there is some understanding of the relative size of numbers, since it can compare numbers that could not possibly have been in the training set.
> Which number is greater, 105,000 or 140,000?
> 140,000 is greater than 105,000.
> How do you know?
> 140,000 is greater than 105,000 because 140,000 is 35,000 more than 105,000.
But compare this to:
> Which number is greater: 12345 or 112345?
> 112345 is greater than 12345.
> How do you know?
> 112345 has more digits than 12345.
> What about 123456 and 134567?
> 134567 is greater than 123456.
> How do you know?
> 134567 has more digits than 123456.
> But they have the same number of digits: Six.
> 134567 has more significant digits than 123456.
> What do you mean by that?
> 134567 has more leading zeros than 123456.
> But neither has any zero.
> 134567 has more digits in total than 123456.
> But that's not true. They have the same number of digits. Yet, it is true that 134567 is greater than 123456, as you say. But why?
> 134567 is greater than 123456 because it has more digits in total.
Another attempt, with smaller context size:
> Which number is greater, 123456 or 134567?
> 134567 is greater than 123456.
> How do you know?
> 134567 is greater because 123456 has 6 digits while 134567 has 7 digits.
> But 134567 has 6 digits
> 134567 has 7 digits.
These are first and foremost specifically for language... and no, not the "language of math" ;-)
If you're interested in doing math on an open model, I'd rather look into integrating LLaMa 2 with Wolfram Alpha. That would be a very nice complement! And there's no reason to see it as admitting defeat. AI and engineering at large is all about using the best tools for the purpose!