undefined | Better HN

0 pointsmike_hearn1y ago0 comments

Yeah a tiny vocab of characters doesn't work that well, it was tried very early on and creating large vocabs of tokens was a big improvement. Which makes sense. A lot of tokens are full words and so the token->embedding phase can quickly look up an embedding in vector space that contains a lot of meaning, whereas an embedding of 'z' or whatever is going to be meaningless.

0 comments

HarHarVeryFunny1y ago

I guess this extends to numbers split across multiple tokens too (especially in the somewhat odd way the OpenAI tokenizer does it). The model is having to work really hard to learn what a given sequence of number chunks means (e.g. chunks '123' '45' vs '123' '4'). It somehow need to realize that the embedding for '4' represents a single-digit number, but the embedding for '45' represents a two-digit number, and this then correspondingly changes the meaning of the preceding '123' token!

It would have made it easier for the model to grok numbers if, similar to the proposed alternative, if 1234 was tokenized as '1000' '200' '30' '4' for powers of 10 up to some reasonable limit (then maybe '1^' '2^' after this reasonable limit). This would let the model easily grok human-sized numbers and need to work harder to grok, say, 20-digit ones, just the same as we do. Some early curriculum training, while not necessary, could then help it to quickly learn which embeddings represent numbers which are d * 10^1 vs d * 10^2, etc.

mike_hearnOP1y ago

That's sort of what this paper is doing. They add positional embeddings so the model can understand the positions of the digits inside the numbers better.

j / k navigate · click thread line to collapse

0 comments

HarHarVeryFunny1y ago

mike_hearnOP1y ago

That's sort of what this paper is doing. They add positional embeddings so the model can understand the positions of the digits inside the numbers better.

j / k navigate · click thread line to collapse