undefined | Better HN

0 pointsatgctg1y ago0 comments

Tiktoken added support for GPT-4o: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74c...

It has an increased vocab size of 200k.

0 comments

Does that imply they retrained the foundation model from scratch? I thought changing the tokenization was something you couldn't really retrofit to an existing model. I mean sure they might have initialized the weights from the prior GPT-4 model but it'd still require a lot of retraining.

famouswaffles1y ago

Yeah and they say as much in the blog.

minimaxir1y ago

For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.

simonw1y ago

Oh interesting, does that mean languages other than English won't be paying such a large penalty in terms of token lengths?

With previous tokenizers there was a notable increase in the number of tokens needed to represent non-English sentences: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

tedsanders1y ago

Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.

1 more reply

kristofferR1y ago

How are they able to use such a brand name, Tiktoken? Is it because TikTok is Chinese? Tiktoken, it's almost like if Apple released the Facebooken library for something entirely unrelated to Facebook.

gemeral1y ago

That's not the right analogy. The "tok" in "Tiktoken" comes from "token", not "TikTok".

2 more replies

moffkalast1y ago

Lots of those tokens would have to be pixel patches and sound samples right?

nojvek1y ago

Yep. Since it’s multimodal. Pictures, text, audio all go into token space.

j / k navigate · click thread line to collapse