Well, markdown and HTML are encoding the same information, but markdown is effectively compressing the semantic information. This works well for humans, because the renderer (whether markdown or plaintext) decompresses it for us. Two line breaks, for example, “decompress” from two characters to an entire line of empty space. To an LLM, though, it’s just a string of tokens.
So consider this extreme case: suppose we take a large chunk of plaintext and compress it with something like DEFLATE (but in a tokenizer friendly way), so that it uses 500 tokens instead of 2000 tokens. For the sake of argument, say we’ve done our best to train an LLM on these compressed samples.
Is that going to work well? After all, we’ve got the same information in a quarter as many tokens. I think the answer is pretty obviously “no”. Not only are we using a small fraction as much time and space to process the information, but the LLM will be forced to waste a lot of that computation on decompressing the data.