undefined | Better HN

0 pointskouteiheika2mo ago0 comments

> This is almost certainly wrong.

So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

> Putting an inductive bias in your tokenizer seems just a terrible idea.

You're already effectively doing this by the sheer fact of using a BPE tokenizer, and especially with modern BPE-based LLM tokenizers[1]. I agree trying to bake this manually in a tokenizer is most likely not a good idea, but I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

[1] Example from Qwen3.6 tokenizer:

    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      }
    ]
  },

"pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Isolated", "invert": false } ] },

0 comments

2 comments · 1 top-level

nl2mo ago· 1 in thread

> So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

Just modeling whitespace as its own token would seem to explain the increase.

> Qwen3.6 tokenizer: "pretokenizer"

That's the pre-tokenizer, not the tokenizer. That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

> I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. See the BPE paper: https://arxiv.org/abs/1508.07909

BPE already learns morphology because it sees the raw bytes.

kouteiheikaOP2mo ago

> That's the pre-tokenizer, not the tokenizer.

Yes, it's an extra tokenizer which runs before the learned tokenizer and injects an inductive bias into it.

> That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

While it does indeed speed up training of the tokenizer, no, it isn't mostly just a performance optimization? It injects a clear cut inductive bias into the tokenizer (split by words, split by punctuation, don't merge words and numbers, etc. -- is that not an inductive bias?), and for some languages (e.g. Asian languages which don't use spaces) the "it's just for performance" argument doesn't make as much sense because there it has no spaces to split on, so the chunks of text are much longer (although it does still split on punctuation, etc.).

Can we not agree that the absolutist position of "Putting an inductive bias in your tokenizer seems just a terrible idea." (as in - any inductive bias) is not actually true, especially since people are actually doing it?

Note, I'm not actually arguing that hand-crafted morphological tokenizers are better. (Which is the straw man many people seem to be replying to.) I'm just arguing that it should be feasible to train your tokenizer in a more morphologically aware way, because BPE doesn't do that.

> The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. [..] BPE already learns morphology because it sees the raw bytes.

The reason everyone went to BPE is because of the bitter lesson (and because you don't have to hardcode your whole vocabulary, i.e. no UNK tokens), and not because it's particularly good at learning the morphology of the actual text. It's trivial to show countless examples where it fails to do so.

j / k navigate · click thread line to collapse