Consider the input string " grabbed".
If we wanted to map this string to tokens by greedily going from left to right and choosing tokens from the vocabulary with the strategy of minimizing the number of tokens, our algorithm would be very simple. We would end up with the following tokenization: [17229, 2580] == [" grab", "bed"]
Surprisingly, the LLaMA tokenizer does not work this way. It actually finds a "worse" tokenization for this input string: [2646, 1327, 287] == [" gra", "bb", "ed"]
The tokenizer arrives at this 3 token output by applying "merges" in a priority order. For example, this is a merge: [" g", "r"] -> " gr". The trained data contains tens of thousands of these merges. When we apply the merges in the priority order, we end up with 3 tokens.
Now you might be thinking, that's easy, we'll just iterate the list of merges and see if any of them apply. Only problem with that approach is that applying a merge can open up a new opportunity to merge something else that wasn't possible before. This right here is the key thing that makes this problem complicated. We can solve this problem by iterating all possible merges from the beginning after every time we apply a merge. This would produce the correct solution. Only problem is: our algorithm is now very slow and takes minutes to run...