Even if strawberry is decomposed as "straw-berry", the required logic to calculate 1+2 seems perfectly within reach.
Also, the LLM could associate a sequence of separate characters to each token. Most LLMs can spell out words perfectly fine.
Am I missing something?
For example, according to https://platform.openai.com/tokenizer, "strawberry" would be tokenized by the GPT-4o tokenizer as "st" "raw" "berry" (tokens don't have to make sense because they are based on byte-pair encoding, which boils down to n-gram frequency statistics, i.e. it doesn't use morphology, syllables, semantics or anything like that).
Those tokens are then converted to integer IDs using a dictionary, say maybe "st" is token ID 4663, "raw" is 2168 and "berry" is 487 (made up numbers).
Then when you give the model the word "strawberry", it is tokenized and the input the LLM receives is [4463, 2168, 487]. Nothing else. That's the kind of input it always gets (also during training). So the model has no way to know how those IDs map to characters.
As some other comments in the thread are saying, it's actually somewhat impressive that LLMs can get character counts right at least sometimes, but this is probably just because they get the answer from the training set. If the training set contains a website where some human wrote "the word strawberry has 3 r's", the model could use that to get the question right. Just like if you ask it what is the capital of France, it will know the answer because many websites say that it's Paris. Maybe, just maybe, if the model has both "the word straw has 1 r" and "the word berry has 2 r's" and the training set, it might be able to add them up and give the right answer for "strawberry" because it notices that it's being asked about [4463, 2168, 487] and it knows about [4463, 2168] and [487]. I'm not sure, but it's at least plausible that a good LLM could do that. But there is no way it can count characters in tokens, it just doesn't see them.
If I'm missing something and you have a source for the claim that character information is present in the input after tokenization, please provide it. I have never implemented an LLM or fiddled with them at low level so I might be missing some detail, but from everything I have read, I'm pretty sure it doesn't work that way.
> But there is no way it can count characters in tokens, it just doesn't see them.
If that is the case, then how can most LLMs (tested with ChatGPT and Llama 3) spell out words correctly?
I don't understand how most LLMs can spell out words though, nor do I understand what is causing the failure to count characters in words. I was not convinced by the comment I was responding to.
You should ask why it is that any of those tasks work, rather than ask why counting letter doesn't work.
Also, LLMs screw up many of those tasks more than you'd expect. I don't trust LLMs with any kind of numeracy what-so-ever.