How well does ChatGPT speak Japanese? (opens in new tab)

(passaglia.jp)

4 pointsnevatiaritika3y ago2 comments

2 comments

2 comments · 1 top-level

themodelplumber3y ago· 1 in thread

It's interesting to see a possible LLM-ethic starting to emerge in there:

> tokenizes text in a linguistically equitable way

> ensures that linguistic diversity is prioritized throughout the training process

PaulHoule3y ago

I’m not sure if there is such a thing as “equitable” tokenization in that if you applied the same Byte Pair Encoding to a mostly Japanese corpus you would see many whole Japanese characters (犬, pronounced inu as a token as opposed to three bytes) and possibly words comprised of several characters or (say 日本語, Japanese language, pronounced nihongo). Note in both of those cases the number of bytes used to tokenize is the same or less than the number of roman letters to spell them out.

You’d get very different results if you used a different language and my guess is that if you applied BPE to a huge corpus that was balanced (same amount of Japanese, Korean, French, …) you’d get something mediocre for all of them.

On the other hand this article seems to show that GPT-4 does a good job with a terrible tokenization which leaves me thinking… Could we just give up on word parts entirely and fall back on character (or UTF-8) modeling?

j / k navigate · click thread line to collapse