undefined | Better HN

0 pointsyorwba1y ago0 comments

I think the main difference is in the intellectual lineage. Re-Pair was designed as a compression algorithm where you keep merging until you're left with a single token and then use the sequence of merges as a compressed representation of the original string. But you can also stop it partway through and treat it as a particularly clever byte-pair encoding implementation.

The algorithm is recursive in the sense that merged tokens can participate in further merges just as in byte-pair encoding, but it involves a whole lot of pointer-chasing, so the core is inherently quite iterative. Those pointers allow skipping over all tokens which are unaffected by a merge, so updating the counts is very cheap. You can imagine each initial token being equipped with a fixed compute budget, and merging two tokens uses up the compute budget of one token, but the compute budget for the other token remains for the merge result to use. So the overall time is bounded by (compute budget per token) x (number of tokens).

0 comments

alexandermorgan1y ago

Super interesting, thanks! I'll have to keep thinking about it.

j / k navigate · click thread line to collapse