undefined | Better HN

Skip to content

Top Best Ask Show New Jobs

0 pointsrhdunn3y ago0 comments

The problem with GPT and other LLMs is that they don't tokenize words at a word or morpheme level, it's just blocks of up to 4 characters, so you get tokens like `!"` instead of two separate tokens. -- That makes it harder to write custom tools on top of, unlike e.g. the output/model of things like the universaldependencies project.

0 comments

4 comments · 2 top-level

DougBTX3y ago· 2 in thread

Do you strictly need that level of tokenisation precision to meet your high-level goals?

morkalork3y ago

This is my first reaction as well. Talking about tokenization and POS tagging is getting lost in the weeds when one has goals like this:

>I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.

This more like NLU than an NLP problem isn't it? It's like tracking how much of a Harry Potter book contains Voldemort content without knowing ahead of time that he may be referred to as He Who Must Not Be Named, You-Know-Who, The Dark Lord and so on. One would have to first identify the thing you're interested in, then learn when characters/the author invent new ways to refer to it, and carry all those forwards to find new instances. Fun!

rhdunnOP3y ago

I also want to tag and highlight those parts of the document. For that, I need to know where the label starts and ends, which you can't really do when you don't have control over the tokens.

It's also hard to write custom inference/tagging rules, like in the case you mentioned w.r.t. Voldemort, if you don't know what the tokens look like.

viksit3y ago

perhaps a spacy pipeline using gpt and huggingface?

j / k navigate · click thread line to collapse