undefined | Better HN

0 pointsTeMPOraL2y ago0 comments

> But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

Here's a thing though: for 99%+ of that content, being turned into feedstock for ML model training is about the only valuable thing that came of its existence.

If it were not for world-ending danger of too smart an AI being developed too quickly, I'd vote for exempting ML training from copyright altogether, today - it's hard to overstate just how much more useful any copyrighted content is for society as LLM training data, than as whatever it was created for originally.

0 comments

3 comments · 1 top-level

tsimionescu2y ago· 2 in thread

Except if you do that, you will see the number of content producers plummet quite quickly, and then you won't have any new training data to train new LLMs on.

aspenmayer2y ago

Would it not logically follow that nothing of value would be lost, even if that were the case? From the point of view of LLMs and content creators, I would treat potential loss of future content being created like I would treat a lost sale. LLMs have value now because of training performed on content that already exists. There must be diminishing returns for certain types of content relative to others. Certain content is only of value if it is timely, and going forward, content that derives its worth from timeliness would find its creation and associated costs of production and acquisition self-justifying. If content isn’t of value to humans now or in the future, nor even of value to LLMs now or in the foreseeable future, not even hypothetically, then why should we decry or mourn its loss or absence or failure to be created or produced or sold?

tsimionescu2y ago

That's like saying that if a competitor can take your products from your warehouse and sell them for pennies on the dollar, your business has no value. The point is that, to some extent, OpenAI is selling access to NYT content for much cheaper than NYT, while paying exactly 0 to NYT for this content. Obviously, the NYT content costs the NYT more than 0 to produce, so they just can't compete on price with OpenAI, for their own content.

Note that I don't see any major problem if only articles that were, say, more than 5 or 10 years old were being used. I don't think the current length of copyright makes any sense. But there is a big difference from last year's archive vs today's news.

1 more reply

j / k navigate · click thread line to collapse