undefined | Better HN

0 pointssmokel1y ago0 comments

This reasoning is interesting, but what is stopping an LLM from simply knowing the number of r's _inside_ one token?

Even if strawberry is decomposed as "straw-berry", the required logic to calculate 1+2 seems perfectly within reach.

Also, the LLM could associate a sequence of separate characters to each token. Most LLMs can spell out words perfectly fine.

Am I missing something?

0 comments

10 comments · 4 top-level

Al-Khwarizmi1y ago· 4 in thread

The problem is not the addition, is that the LLM has no way to know how many r's a token might have, because the LLM receives each token as an atomic entity.

For example, according to https://platform.openai.com/tokenizer, "strawberry" would be tokenized by the GPT-4o tokenizer as "st" "raw" "berry" (tokens don't have to make sense because they are based on byte-pair encoding, which boils down to n-gram frequency statistics, i.e. it doesn't use morphology, syllables, semantics or anything like that).

Those tokens are then converted to integer IDs using a dictionary, say maybe "st" is token ID 4663, "raw" is 2168 and "berry" is 487 (made up numbers).

Then when you give the model the word "strawberry", it is tokenized and the input the LLM receives is [4463, 2168, 487]. Nothing else. That's the kind of input it always gets (also during training). So the model has no way to know how those IDs map to characters.

As some other comments in the thread are saying, it's actually somewhat impressive that LLMs can get character counts right at least sometimes, but this is probably just because they get the answer from the training set. If the training set contains a website where some human wrote "the word strawberry has 3 r's", the model could use that to get the question right. Just like if you ask it what is the capital of France, it will know the answer because many websites say that it's Paris. Maybe, just maybe, if the model has both "the word straw has 1 r" and "the word berry has 2 r's" and the training set, it might be able to add them up and give the right answer for "strawberry" because it notices that it's being asked about [4463, 2168, 487] and it knows about [4463, 2168] and [487]. I'm not sure, but it's at least plausible that a good LLM could do that. But there is no way it can count characters in tokens, it just doesn't see them.

psb2171y ago

Tokenization does not remove information from the input[1]. All the information required for character counting is still present in the input following tokenization. The reasons you give for why counting characters is hard could be applied to essentially all other forms of question answering. Ie, to answer questions of type X in general, the LLM will have to generalize from questions of type X in the training corpus to questions of type X with novel surface forms which it sees at test time. [1]tokenizers can remove information if designed to do so, but they don't in these simple scenarios

Al-Khwarizmi1y ago

As far as I know, that's not the case. The tokenizer takes a bunch of characters, like "berry", identifies it as a token, and what the LLM gets is the token ID. It doesn't have access to the information about which letters that token is composed of. Here is an explanation by OpenAI themselves: https://help.openai.com/en/articles/4936856-what-are-tokens-... - as you can see, "Models take the prompt, convert the input into a list of tokens, processes the prompt, and convert the predicted tokens back to the words we see in the response". And the tokens are basically IDs, without any internal structure - there are examples there.

If I'm missing something and you have a source for the claim that character information is present in the input after tokenization, please provide it. I have never implemented an LLM or fiddled with them at low level so I might be missing some detail, but from everything I have read, I'm pretty sure it doesn't work that way.

1 more reply

smokelOP1y ago

Thank you for taking the time to write this response. Unfortunately, even though I agree that tokenization makes it pretty hard for the LLM to count characters, I'm still not convinced that it is a fundamental problem for doing so. I think the lack of (or limited amount of) symbolic processing is an even more important factor.

> But there is no way it can count characters in tokens, it just doesn't see them.

If that is the case, then how can most LLMs (tested with ChatGPT and Llama 3) spell out words correctly?

saalweachter1y ago

Might that also be the answer to why it says "2"? There are probably sources of people saying there are two R's in "berry", but no one bothers to say there is 1 R in "raw"?

ClassyJacket1y ago· 1 in thread

It doesn't see "straw" or "berry". It sees a vector which happens to represent the word strawberry and is translated from and to English on the way in and out. It never sees the letters, 'strawberry' is represented by a number, or group of numbers. Try to count the Rs in "21009873628" - you can't.

smokelOP1y ago

I'm aware of this. The network could, and apparently does, associate single characters with words. It can associate "red" with "rose", and might associate "r" with "straw", and it might even associate some kind of embedding of "two r's" with "berry".

azulster1y ago· 1 in thread

yes, you are missing that the tokens aren't words, they are 2-3 letter groups, or any number of arbitrary sizes depending on the model

smokelOP1y ago

Nope, I'm not missing that particular fact. I'm aware that sentences (and words) are split into tokens, which are vectors.

I don't understand how most LLMs can spell out words though, nor do I understand what is causing the failure to count characters in words. I was not convinced by the comment I was responding to.

Der_Einzige1y ago

The fact that any of those tasks at all work so well despite tokenization is quite remarkable indeed.

You should ask why it is that any of those tasks work, rather than ask why counting letter doesn't work.

Also, LLMs screw up many of those tasks more than you'd expect. I don't trust LLMs with any kind of numeracy what-so-ever.

j / k navigate · click thread line to collapse

0 comments

10 comments · 4 top-level

Al-Khwarizmi1y ago· 4 in thread

The problem is not the addition, is that the LLM has no way to know how many r's a token might have, because the LLM receives each token as an atomic entity.

Those tokens are then converted to integer IDs using a dictionary, say maybe "st" is token ID 4663, "raw" is 2168 and "berry" is 487 (made up numbers).

psb2171y ago

Al-Khwarizmi1y ago

1 more reply

smokelOP1y ago

> But there is no way it can count characters in tokens, it just doesn't see them.

If that is the case, then how can most LLMs (tested with ChatGPT and Llama 3) spell out words correctly?

saalweachter1y ago

Might that also be the answer to why it says "2"? There are probably sources of people saying there are two R's in "berry", but no one bothers to say there is 1 R in "raw"?

ClassyJacket1y ago· 1 in thread

smokelOP1y ago

azulster1y ago· 1 in thread

yes, you are missing that the tokens aren't words, they are 2-3 letter groups, or any number of arbitrary sizes depending on the model

smokelOP1y ago

Nope, I'm not missing that particular fact. I'm aware that sentences (and words) are split into tokens, which are vectors.

I don't understand how most LLMs can spell out words though, nor do I understand what is causing the failure to count characters in words. I was not convinced by the comment I was responding to.

Der_Einzige1y ago

The fact that any of those tasks at all work so well despite tokenization is quite remarkable indeed.

You should ask why it is that any of those tasks work, rather than ask why counting letter doesn't work.

Also, LLMs screw up many of those tasks more than you'd expect. I don't trust LLMs with any kind of numeracy what-so-ever.

j / k navigate · click thread line to collapse