undefined | Better HN

0 pointsleroman1y ago0 comments

Markdown being a very minimal Markup language has no need for much of the structural and presentational stuff (CSS, structural HTML), HTML has many many artifacts which are a huge bloat and give no semantic value IMO.. It's the goal here to capture any markup with semantic value, if you have examples this library might miss, you are welcome to share and I will look into it!

0 comments

3 comments · 1 top-level

mistercow1y ago· 2 in thread

Well, markdown and HTML are encoding the same information, but markdown is effectively compressing the semantic information. This works well for humans, because the renderer (whether markdown or plaintext) decompresses it for us. Two line breaks, for example, “decompress” from two characters to an entire line of empty space. To an LLM, though, it’s just a string of tokens.

So consider this extreme case: suppose we take a large chunk of plaintext and compress it with something like DEFLATE (but in a tokenizer friendly way), so that it uses 500 tokens instead of 2000 tokens. For the sake of argument, say we’ve done our best to train an LLM on these compressed samples.

Is that going to work well? After all, we’ve got the same information in a quarter as many tokens. I think the answer is pretty obviously “no”. Not only are we using a small fraction as much time and space to process the information, but the LLM will be forced to waste a lot of that computation on decompressing the data.

michaelmior1y ago

I think one big difference between DEFLATE and most other standard compression algorithms is that they're dictionary-based. So compressing in this way, you're really messing with locality of tokens in way that is likely unrelated to the semantics of what you're compressing.

For example, adding a repeated word somewhere in a completely different part of the document could change the dictionary and the entirety of the compressed text. That's not the case with the "compression" offered by converting HTML to Markdown. This compression more or less preserves locality and potentially removes information that is semantically meaningless (e.g. nested `div`s used for styling). Of course, this is really just conjecture on my part, but I think HTML > Markdown is likely to work well. It would certainly be interesting to have a good benchmark for this.

mistercow1y ago

Absolutely. I'm just making a more general point that "the same information in fewer tokens" does not mean "more comprehensible to an LLM". And we have more practical evidence that that's not the case, like the recent "Let's Think Dot by Dot" paper, which found that you can get many of the benefits of chain-of-thought simply by adding filler tokens to your context (if your model is trained to deal with filler tokens). For that matter, chain-of-thought itself is an example of increasing the tokens:information ratio, and generally improves LLM performance.

That's not to say that I think that converting to markdown is pointless or particularly harmful. Reducing tokens is useful for other reasons; it reduces cost, makes generation faster, and gives you more room in the context window to cram information into. And markdown is a nice choice because it's more comprehensible to humans, which is a win for debuggability.

I just don't think you can justifiably claim, without specific research to back it up, that markdown is more comprehensible to LLMs than HTML.

https://arxiv.org/abs/2404.15758

1 more reply

j / k navigate · click thread line to collapse