undefined | Better HN

0 pointscolechristensen2y ago0 comments

I think NYT is going to win.

LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.

The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.

AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.

There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.

Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.

0 comments

1 comments · 1 top-level

z4y5f32y ago

Unfortunately GZIP won't beat LLMs for text classification. The research you cited is just poorly done science that has been widely debunked. The original paper compared top-2 accuracy of GZIP with top-1 accuracy with BERT. The dataset also contains a lot of train/test data leakage. See this article for the rebuttal: https://kenschutte.com/gzip-knn-paper/ and this thread for a previous discussion on hackernews: https://news.ycombinator.com/item?id=36758433.

Further, the evidence presented by NYT in the lawsuit could be hard to reproduce. I tried multiple prompts on multiple versions of GPT-4 APIs but still could not get GPT-4 to reproduce NYT articles exactly. NYT might as well tried to let GPT-4 reproduce 100,000 articles and only found a few cases where GPT-4 actually recited the whole article. In that case OpenAI might as well be arguing that this is only a rare bug and avoid losing the lawsuit in a massive way.

j / k navigate · click thread line to collapse

0 comments

1 comments · 1 top-level

z4y5f32y ago

j / k navigate · click thread line to collapse