undefined | Better HN

0 pointsfreejazz2y ago0 comments

>What you described is entirely fair use, actually.

Based upon what? You think other publishers use NYTimes articles for free without license?

0 comments

3 comments · 1 top-level

ummonk2y ago· 2 in thread

He's talking about citing and quoting NYTimes articles, not republishing them verbatim. That said, it's very different if you're a publication that sometimes cites reporting from other publications vs. a website exclusively dedicated to indexing and summarizing NYTimes articles.

fennecbutt2y ago

I couldn't get gpt to quote an actual nyt article no matter how hard I tried...it just hallucinated in the general style of a news article.

Presumably, if it can remember at least a paragraph or two of each article, then surely the same would be true of any text it ingested and the model size would approach the dataset size (probably actually much larger). I don't believe this is the case at all, even searching around, I've not found any good recent examples of it regurgitating copyrighted text verbatim.

It's cool to hate AI stuff if you're a creative atm. But gotta love those generative/algorithm based PS brushes, that's still real art!

"Indeed, the opening paragraph of "A Game of Thrones" by George R.R. Martin, with the chapter titled "Bran," starts as follows:

"The morning had dawned clear and cold, with a crispness that hinted"

And then it cuts off, whether that's because OAI now have an oh shit filter or just the model had access to the first page or publicly available articles quoting the first line, I'm not sure.

I tried other chapters and random sections and it could get a sentence or two right but then hallucinated; what's more likely NYT and GRRM? That your works are being reproduced verbatim? Or that Facebook, YouTube descriptions, fan tumblrs and hell, the publicly available and multiple GoT related wikis that include a variety of passages from the books were used as training data?

ummonk2y ago

I don't think it's necessarily true that model size would need to be larger than dataset size. It's theoretically possible that the model encoding achieves significantly better compression than DEFLATE or GZIP or whatever compression algorithm is used to store the dataset itself.

j / k navigate · click thread line to collapse