undefined | Better HN

0 pointsWowfunhappy11mo ago0 comments

A .zip is lossless compression. But we also have plenty of lossy compression algorithms. We've just never been able to use lossy compression on text.

0 comments

6 comments · 2 top-level

Workaccount211mo ago· 4 in thread

>We've just never been able to use lossy compression on text.

...and we still can't. If your lawyer sent you your case files in the form of an LLM trained on those files, would you be comfortable with that? Where is the situation you would compress text with an LLM over a standard compression algo? (Other than to make an LLM).

Other lossy compression targets known superfluous information. MP3 removes sounds we can't really hear, and JPEG works by grouping uniform color pixels into single chunks of color.

LLM's kind of do their own thing, and the data you get back out of them is correct, incorrect, or dangerously incorrect (i.e. is plausible enough to be taken as correct), with no algorithmic way to discern which is which.

So while yes, they do compress data and you can measure it, the output of this "compression algorithm" puts in it the same family as a "randomly delete words and thesaurus long words into short words" compression algorithms. Which I don't think anyone would consider to compress their documents.

tshaddox11mo ago

> If your lawyer sent you your case files in the form of an LLM trained on those files, would you be comfortable with that?

If the LLM-based compression method was well-understood and demonstrated to be reliable, I wouldn't oppose it on principle. If my lawyer didn't know what they were doing and threw together some ChatGPT document transfer system, of course I wouldn't trust it, but I also wouldn't trust my lawyer if they developed their own DCT-based lossy image compression algorithm.

antonvs11mo ago

> LLM's kind of do their own thing, and the data you get back out of them is correct, incorrect, or dangerously incorrect (i.e. is plausible enough to be taken as correct), with no algorithmic way to discern which is which.

Exactly like information from humans, then?

esafak11mo ago

People summarize (compress) documents with LLMs all day. With legalese the application would be to summarize it in layman's terms, while retaining the original for legal purposes.

Workaccount211mo ago

Yes, and we all know (ask teachers) how reliable those summaries are. They are randomly lossy, which makes them unsuitable for any serious work.

I'm not arguing that LLMs don't compress data, I am arguing that they are technically compression tools, but not colloquially compression tools, and the overlap they have with colloquial compression tools is almost zero.

3 more replies

171862744011mo ago

SMS codes are kind of a lossy text-compression.

j / k navigate · click thread line to collapse