I’m not sure if there is such a thing as “equitable” tokenization in that if you applied the same Byte Pair Encoding to a mostly Japanese corpus you would see many whole Japanese characters (犬, pronounced
inu as a token as opposed to three bytes) and possibly words comprised of several characters or (say 日本語, Japanese language, pronounced
nihongo). Note in both of those cases the number of bytes used to tokenize is the same or less than the number of roman letters to spell them out.
You’d get very different results if you used a different language and my guess is that if you applied BPE to a huge corpus that was balanced (same amount of Japanese, Korean, French, …) you’d get something mediocre for all of them.
On the other hand this article seems to show that GPT-4 does a good job with a terrible tokenization which leaves me thinking… Could we just give up on word parts entirely and fall back on character (or UTF-8) modeling?