Show HN: LLaMA tokenizer that runs in browser (opens in new tab)

Hi HN! I was looking for a tokenizer that would accurately(!) count tokens in browser, and I couldn't find one. So I thought "how hard can it be", and here we are 2 weeks later...

YetAnotherNick3y ago

Great repo, but there was this for openAI which was bit hard to find: https://github.com/cogentapps/chat-with-gpt/blob/main/app/sr....

Not completely sure, but I think it will likely work as it is for llama as both are BPE following same rules.

belladoreaiOP3y ago

I think it would take quite a bit of work for someone to grab that BPE implementation and make it work for LLaMA. Less work than rewriting the whole tokenizer from scratch, for sure, but a non trivial amount of work anyway.

zerojames3y ago

I love this sentiment! Amazing work!

Solvency3y ago

For those who would also think the same thing, what're some of the the tldr bulletpoints on why this is more complicated than it'd seem?

belladoreaiOP3y ago

I'll answer with an example.

Consider the input string " grabbed".

If we wanted to map this string to tokens by greedily going from left to right and choosing tokens from the vocabulary with the strategy of minimizing the number of tokens, our algorithm would be very simple. We would end up with the following tokenization: [17229, 2580] == [" grab", "bed"]

Surprisingly, the LLaMA tokenizer does not work this way. It actually finds a "worse" tokenization for this input string: [2646, 1327, 287] == [" gra", "bb", "ed"]

The tokenizer arrives at this 3 token output by applying "merges" in a priority order. For example, this is a merge: [" g", "r"] -> " gr". The trained data contains tens of thousands of these merges. When we apply the merges in the priority order, we end up with 3 tokens.

Now you might be thinking, that's easy, we'll just iterate the list of merges and see if any of them apply. Only problem with that approach is that applying a merge can open up a new opportunity to merge something else that wasn't possible before. This right here is the key thing that makes this problem complicated. We can solve this problem by iterating all possible merges from the beginning after every time we apply a merge. This would produce the correct solution. Only problem is: our algorithm is now very slow and takes minutes to run...

2 more replies

szopa3y ago· 2 in thread

Tokenizers seem to be a massive pain in the neck if you are just calling into an API to use your model. The algorithm itself is non-trivial, and they need pretty sizable data to function: the vocabulary and the merges, which just sit there, using memory. I'm writing https://github.com/ryszard/agency in Go, and while there's a good library for the OpenAI tokenization, if you want a tokenizer for the HF models the best I found was a library calling HF's Rust implementation, which makes it horrible for distribution.

However, at some point I realized that I needed not really the tokens, but the token count, as my most important use was implementing a Token Buffer Memory (trim messages from the beginning in such a way that you never exceed a context size number of tokens). And in order to do that I don't need it to be exactly right, just mostly right, if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports). So, I took files from Project Gutenberg, and compared the ratio of tokens I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).

I'm not throwing shade at this project – just being able to call the tokenizer would've saved me a lot of time. But I hope that if I'm wrong about the estimates bring good enough some good person will point out the error of my ways :)

belladoreaiOP3y ago

> if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports) ... multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude

This sounds reasonable to me. You might also want to consider estimates based on the number of characters. And you also need a fallback for what to do when the user inputs some weird input that doesn't fall inside your safety margin, but instead causes OpenAI API to return an error (maybe in that case you aggressively trim the input and retry?)

hospitalJail3y ago

> I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).

One time I suggested this, got downvoted to hell.

To be fair to the downvoters, I quoted OpenAIs 7 tokens per word(on their tutorial page).

Seems incredibly unrealistic in hindsight, but at the time, things were fresh. Also, I think most people wanted something more robust than a linear calculation.

Hedepig3y ago· 1 in thread

Somewhat tangimential, are there any open source attempts to compete with OpenAI's embeddings?

I know Word2Vec is a thing but I believe that is on a word by word basis, and doesn't capture the semantic meaning of whole sentences and paragraphs.

They charge so little for embeddings I secretly hope they do open source it. Because if for some reason it is stopped, any search functionality or the like that relies upon the API would cease to function

dhruv_anand3y ago

https://news.ycombinator.com/item?id=36105660

Models under sentence-transformers are commonly used by people.

You can check this leaderboard, where OpenAI's embeddings are outperformed by open source ones: https://huggingface.co/spaces/mteb/leaderboard

superkuh3y ago

I've been wondering how to use two spaces as a stop token with the llama models for months. Reading the source of this finally clued me in, "__". Nice. This is significantly easier to comprehend than sentencepiece.

j / k navigate · click thread line to collapse

23 comments

19 comments · 5 top-level

adroitboss3y ago· 6 in thread

Does anyone know of a chatgpt/gpt-4 tokenizer that can run client-side?

sp3323y ago

https://platform.openai.com/tokenizer or the official python library tiktoken https://github.com/openai/tiktoken or this JS port of tiktoken https://github.com/dqbd/tiktoken

oofsa3y ago

https://platform.openai.com/tokenizer is not for GPT-4 but GPT-3. https://tiktokenizer.vercel.app/ supports GPT-4.

adroitboss3y ago

Thanks for the recommendations. I found the one I needed because of this comment thread.

YetAnotherNick3y ago

https://github.com/cogentapps/chat-with-gpt/blob/main/app/sr...

WilliamBerglund3y ago

Yes, tiktoken, here's a client side visualizer for it

https://github.com/functorism/gpt4-tokenizer-visualizer

adroitboss3y ago

Thanks for the recommendations everyone.

belladoreaiOP3y ago· 5 in thread

Hi HN! I was looking for a tokenizer that would accurately(!) count tokens in browser, and I couldn't find one. So I thought "how hard can it be", and here we are 2 weeks later...

YetAnotherNick3y ago

Great repo, but there was this for openAI which was bit hard to find: https://github.com/cogentapps/chat-with-gpt/blob/main/app/sr....

Not completely sure, but I think it will likely work as it is for llama as both are BPE following same rules.

belladoreaiOP3y ago

zerojames3y ago

I love this sentiment! Amazing work!

Solvency3y ago

For those who would also think the same thing, what're some of the the tldr bulletpoints on why this is more complicated than it'd seem?

belladoreaiOP3y ago

I'll answer with an example.

Consider the input string " grabbed".

Surprisingly, the LLaMA tokenizer does not work this way. It actually finds a "worse" tokenization for this input string: [2646, 1327, 287] == [" gra", "bb", "ed"]

2 more replies

szopa3y ago· 2 in thread

belladoreaiOP3y ago

hospitalJail3y ago

One time I suggested this, got downvoted to hell.

To be fair to the downvoters, I quoted OpenAIs 7 tokens per word(on their tutorial page).

Seems incredibly unrealistic in hindsight, but at the time, things were fresh. Also, I think most people wanted something more robust than a linear calculation.

Hedepig3y ago· 1 in thread

Somewhat tangimential, are there any open source attempts to compete with OpenAI's embeddings?

I know Word2Vec is a thing but I believe that is on a word by word basis, and doesn't capture the semantic meaning of whole sentences and paragraphs.

dhruv_anand3y ago

https://news.ycombinator.com/item?id=36105660

Models under sentence-transformers are commonly used by people.

You can check this leaderboard, where OpenAI's embeddings are outperformed by open source ones: https://huggingface.co/spaces/mteb/leaderboard

superkuh3y ago

j / k navigate · click thread line to collapse