The Pile: An 800GB dataset of diverse text for language modeling (2020) (opens in new tab)

(arxiv.org)

184 pointscharlysl2y ago70 comments

70 comments

56 comments · 8 top-level

sillysaurusx2y ago· 34 in thread

Author here. And by author I mean I created books3 (the books component of The Pile) while everyone else did the hard work of actually writing the paper, ha. Stella and Leo Gao in particular did so much wonderful work on the paper, though it couldn’t have happened without everyone’s contributions.

As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.

There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.

As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: https://twitter.com/theshawwn/status/1641804013791215619?s=6.... They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.

One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/

andy992y ago

Hi Shawn, re your side note, I disagree with you that we'd be better off if weights couldn't be copyrighted - basically because copyright gives options like GPL that can keep models open, otherwise we're just going to see everything good disappear behind trade secret. That said I fully support your "civil disobedience" in sharing the weights. I don't expect you to agree, but take a look at something I just wrote about this yesterday: http://marble.onl/posts/model_weight_copyrights.html . I'm happy to chat about it if you're interested.

nullc2y ago

Even RMS himself prefers the abolition of copyright over the existence of the GPL.

Besides, it's already pretty unambiguous that weights are not copyrightable: they're a result of a mechanical process. The only original creative input that goes in to the weights is the unfathomable amounts of content scraped from other sources that aren't the authorship of the models. The objective of the gradient descent is simply minimizing loss on the training data.

Facebook doesn't own the llama model weights any more than the Bridgeman Art Library practically owns the paintings of European masters because they made quality scans of them. ( https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel.... ), or any more than Rural Telephone owns the phone directory ( https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R.... ).

Trying to make model weights copyrightable is going uphill, and I don't see how you get there without first establishing that the these LLM are unlawful derivatives of a countless number of copyrighted works along the way. Doing so would probably create a immediate monopoly for legally created LLMs for the hand full of corporations with quasi-monopoly content hosting services (facebook, google, etc) that can (and/or already have) stuffed licensing into their terms of use.

Do you want a cyberpunk dystopia? I think creating an AI monopoly is how you get a cyberpunk dystopia -- and the two ways we end up with one is either outright restrictions on private development of AI like some have been lobbying for and the other is the extension of copyright so that only a few entities can get access to enough of other people's data at a low enough cost to train them.

sillysaurusx2y ago

It might seem like I’m entrenched in my position, but it’s quite the opposite — the only reason I’m doing this is because I really believe it’s the best outcome for devs in the long run. I’m open to changing my mind and pulling the plug in everything.

I’ll read over your essay and give it some thought. There are a bunch of subtle aspects to consider; I’ve been thinking it over for about four months now and still haven’t covered all the territory yet.

It feels like this may be one of the most important decisions going forward — both from an intellectual property point of view, and an individual rights perspective. E.g. you mention that it’s civil disobedience to share the weights, but it feels like if someone is claiming to do open science (LLaMA), sharing the research materials is the minimum requirement. Plus look how it’s benefited them; they’ve captured most of the open source LLM mindshare. So it seems likely that this will lead to more open source work in the long run, not less.

Feel free to chat! You can DM me on Twitter or email me. I’ve been in the hospital with my wife for 7 weeks, with two to go, so I’ve been a bit less responsive than I usually am.

1 more reply

js82y ago

> otherwise we're just going to see everything good disappear behind trade secret

We will see that anyway. All the code I work on commercially is copyrighted and yet a trade secret. Existence of copyright (with the exception of copyleft, but that's subversion) didn't help software to be open sourced.

IMHO allowing models to be copyrighted is basically 18th century enclosures again.

YetAnotherNick2y ago

> All the code I work on commercially is copyrighted

What do you mean by that? Do you continuously copyright the changes?

2 more replies

amoss2y ago

> If you’re sceptical, go look at one of the forums where people are building derivatives of Stable Diffusion (possibly NSFW, I’m not providing any links).

Does anybody have a link to a relevant discussion here? I would like to read about the creative process that goes into defining model weights, and how it differs from the mechanical output of running the training algorithm.

zarzavat2y ago

There’s no use living a lie. Training a model is not an act of creative expression and cannot give you authorship of the weights. Enforcing GPL if you don’t have IP no better than any other copyright troll…

vr462y ago

Reading these discussions with interest as I am in the process of making and training my own personal model using my 31+ archive of photography, pictures taken and owned by be, which goes against this idea that models are not trained on data owned by the companies doing the training. While this is all for my own personal interest and use, how would the idea that the weights cannot be copyrighted affect my rights on the model if I were to release the whole thing for use?

londons_explore2y ago

I'd be interested to know how your model performs if it is trained only on your own work.

Assuming you haven't taken a photo a second for your entire life, then I suspect you'll struggle to make something even close to what's available publically, due to lack of training data.

1 more reply

GaggiX2y ago

How big is the archive? These models are typically trained on at least 100M images.

1 more reply

hedgehog2y ago

I understand that LLMs to date have mostly been trained on a wide variety of copyright-encumbered data but in other domains (computer vision for example) the tradeoffs are different and in practice many models are trained on private / unencumbered data. If those weights are not protected by copyright then my concern is it will be hard to sufficiently protect them via license agreement and it will become yet another factor favoring the SaaSification of everything in tech.

sillysaurusx2y ago

This is true, and it's why I hesitated to file legal action. My goal was to benefit hackers. If the outcome causes problems for people who are just trying to share their work, I'd be upset.

Ultimately what convinced me to proceed is that there are immense forces pressuring ML models to become SaaS companies. It's very difficult to offer an ML model for extended periods without being a company. E.g. https://6b.eleuther.ai/ is down. Eleuther failing illustrates just how hard it is –– we were all working as hard as we could to design something that would last a long time, and a long time turned out to be two short years. Contrast that with other kinds of hacking (e.g. webdev, gamedev, hardware...) where the end result lasts basically forever.

So if ML models aren't copyrightable, I think it'll hurt companies a lot more than individuals. In fact the goal is the other way around: to protect individuals. All I did was publish Facebook's own GPL download script to github, and it got DMCA'd. If we don't push back on that kind of behavior now, companies will get used to the idea that they control "their" model –– even when their model is anything but theirs.

tensor2y ago

If an individual trains a model on their own data to embody their own skills and behaviour, so that they can then sell/rent that model out to work on their behalf, well in that scenario not being able to treat the weights as intellectual property (copyright or otherwise controllable by a license), would be a huge violation and detrimental to that individual.

I think it would be a shame to try to build legislation around the notion of the sass melting pot application of machine learning and in the process destroy all sorts of other use cases.

1 more reply

hedgehog2y ago

I think the DMCA being a massive overreach is a separate issue from whether weights should be eligible for copyright. This is a complicated legal area and I'm very much not a lawyer so let me just stick to some examples that guide my thinking:

- Grammarly. Clear value prop, if weights can't be adequately protected then that's a significant headwind against doing processing on the client.

- Adobe Firefly. Could run locally, they understand the technical challenges well, same headwind.

- GitHub Copilot. Same.

Copyright protection is probably not a single deciding issue in their product strategy but all of those are use cases that would for most users be better run locally as hardware can support that and are not going to because it's too much of a risk. Better to limit distribution and protect as trade secret.

The most powerful force for openness I see has nothing to do with copyright eligibility and everything to do with companies wanting to showcase their research arms to build brand and support recruiting. That leads me to believe it's probably better for models to be eligible for copyright and considered derivatives of all of the constituent training data. In some ways the better parallel is sampling in the music industry. It'll be interesting to see how this plays out.

idiotsecant2y ago

Is it useful to protect weights with copyright? What if I download your weights and retrain them for 5 seconds, changing each weight .0000001%? How much change is a new product? What if I change a single weight?

hedgehog2y ago

Like the parallel scenarios of taking a book and changing a few words, slapping a new logo on someone else's app, or stylizing a photo with a filter, those are questions that will be answered in court if people can't come to an agreement on their own.

archivist02y ago

> One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/

I'm afraid to say... the-eye no longer hosts the pile as of today due to legal threats above the likes of DMCA.

Though I believe it's still available via its original torrent and on at.

> https://academictorrents.com/details/0d366035664fdf51cfbe9f7...

sfriedr2y ago

Of this is true, it would be something close of an insane situation: One of the largest datasets, that the largest companies are using to train their models (probably; many of the best LLMs have technical reports that raise more questions rather than answer them) being forced to live an obscure existance on torrents.

From a scientific point of view this is very problematic because few safeguards exist that guarantee that the dataset is not tampered with (as is the case if you'd upload it to Zenodo, which providea some guarantee of immutability).

How about trying to upload the Pile to Zenodo? Only half-joking :D

koheripbal2y ago

I'm more interested in The Pile V2 which seems to have gone underground...

sfriedr2y ago

Could you share more about copyright? For example, aren't you worried that now, with all kinds of lawsuits happening [1] and copyright issues that were found in existing datasets [2], that you might get threatening letters from a lawyer some day?

I'm the author of [3] where we introduced one of the first natural-language datasets that test graduate mathematics for LLMs, but some of the prompts we took from a copyrighted book and therefore thought about excluding them. Having them in the public dataset would be really nice though, hence I'm keen about your experience.

I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

[1] https://www.theguardian.com/books/2023/jul/05/authors-file-a... [2] https://arxiv.org/abs/2105.05241 [3] https://arxiv.org/abs/2301.13867

sillysaurusx2y ago

I think a lot of hackers shy away from doing impactful work because of fear. Sometimes those fears are justified, but it's remarkable how often things that seem like a big deal turn out not to matter. My advice for ambitious devs would be to do what seems interesting, and don't worry too much about threatening letters. Usually the worst thing that happens is that you agree to stop doing whatever generated the threat.

Personally, I'm not worried. It would be a damn shame if academics come under fire merely for trying to operate on the cutting edge of science. None of us were trying to make money; we just wanted to make something interesting.

> I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

Thanks! I think we might be putting up a website for it soon, if only to explain ourselves. In the meantime – I hate this phrase, since I don't want followers – the only way to keep informed is to follow my Twitter, and perhaps keep an eye on my HN comments.

You'll probably hear about it either way though, since it's a groundbreaking case. No one has tested the copyrightability of ML models before.

jacquesm2y ago

What exactly is it that you claim copyright over? Are you sure that you have standing to bring that suit?

1 more reply

Der_Einzige2y ago

Getting sued is straight up a good thing for most peoples careers in tech. Haven't you watched silicon valley?

jacquesm2y ago

> It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.

In my opinion the most ethical outcome would be that they are on the hook for the cumulative cost of the copyright they violated. That way authors would come out ahead instead of having their rights trashed 'because it's too late anyway'.

nl2y ago

Learning from something has never been copyright violation before, even when a computer was learning (eg, building a search index from copyrighted data is fair use; cite: Google cases).

cornel_io2y ago

Whether or not training on publicly available data counts as a copyright violation is still completely up in the air legally, and clearly a lot of lawyers at all of the top tech companies think they're going to end up in the clear under fair use.

At some point this stuff will have to get tested by making its way up the appeals stack in the US, and IMO there is only a minuscule chance that will result in Google, MS, and Meta getting slapped with anything more than a token fine (my bet is it won't even be that), let alone paying every person who ever wrote anything that was used in these datasets for copyright violations, which would basically be everyone.

jacquesm2y ago

There are more courts than just the US ones.

1 more reply

rpdillon2y ago

> on the hook for the cumulative cost of the copyright they violated.

I think there's a strong argument for a Fair Use defense, given the size of the models versus the size of the training sets, as well as the gulf in intended use: an AI model doesn't compete with e.g. a book. Obviously we'll have to see if play out in court to find out.

ben_w2y ago

Current AI models don't compete with a book, from what I've seen; I wouldn't want to bet how long it takes before they can compete with not just one but all books.

1 more reply

lukemerrick2y ago

Related to the idea of "no one trains on data they own, they shouldn't own the resulting model": since big public datasets like The Pile have CC-SA items in them, is anyone considering bringing the argument that model weights are derivative work that must be "shared alike"?

koheripbal2y ago

By that token, my brain is a derivative work of all the copyrighted works I've consumed

koheripbal2y ago

What ever happened with The Pile V2? I spent a couple hours searching for it, but the Eye is impossible to navigate and people on the discord generally invite noobs like myself.

moffkalast2y ago

> They DMCA’ed one of my repos distributing LLaMA

Boy they'll be mad once they learn about Huggingface distributing thousands of LLama fine tunes with full weights.

redox992y ago

Weights being copyrightable is already a questionable thing. Derivatives (like finetunes) is even more questionable.

robertheadley2y ago· 5 in thread

As long as LLMs and generative AI uses copywritten works for training, then they are going to be the enemy of creative people.

koheripbal2y ago

This is like saying that my brain violates copyright when I write sci-fi because one time, years ago, I watched Star Wars.

mattkevan2y ago

Creative people will be using LLMs and other models as new and exciting creative tools.

Their real enemies will be the people who make money off the creative people’s work, e.g. the entire history of recorded music or the current writers strike.

splatzone2y ago

Unless the financial benefit could be shared with the original authors somehow, with some kind of royalties system?

fsckboy2y ago

I love how "creatives" enjoy the freedom of the free internet but never try to shame their peers as to whether they use GPL or MIT license for their art.

robertheadley2y ago

I think the more matter of fact the influence, the more the original artist deserves compensation. See Waits v. Frito-Lay, Inc.

I do not not want something like this to happen to generative AI and make things more difficult for the technology to progress and flourish.

Roark662y ago· 4 in thread

Great stuff, I skimmed the article searching for some table showing a breakdown of content by language, but I haven't found one.

I hope there is a lot of text in languages other than English. As for example in my language (Polish) current SOTA models are very deffiecient. I have wondered why is that considering companies like (not at all)OpenAI claim to train on large datasets including in my language of interest. It turns out (and I learned this just yesterday) they used LLM translated English content that that used as other language training data. They used Azure translator which itself is a transformer model to generate content for gpt-3.5 for example. Also, I bet there is a lot of poorly machine translated content in their supposedly "original" data.

The result? You can use chatgpt to write you an email of any kind in English and you can copy/paste/send immediately. Try doing that in Polish... It will make sense, but the language used will use bad tone (too familiar in a business setting), bad words(words that exist, but no real person would use) and sentence layout that just plainly feels weird. I suspect this is even worse in many other languages.

koheripbal2y ago

While having multiple languages makes a model more versatile and appeal to a wider audience, it actually significantly increases the memory required to run the model and thus limits other aspects of the model.

Optimally, a Polish audience should try to create a Polish trained model.

As it stands now, most advanced models, like gpt are multilingual, but are noticeably less capable in non-English languages.

spi2y ago

Having every model re-trained in each language is a certain path towards having any non-English (or at most a couple of other languages from countries with big pockets, like Chinese) language model be always massively behind - the resources required to train a model are huge, you can't expect e.g. the Polish community (plus anyone else) to replicate every good English model that comes out. GPT4 is less capable in Polish than in English, but probably much more than any Polish-specific model ever trained - and I suspect the gap is bigger than that with the best non-GPT4 English model.

Furthermore, I think you are exaggerating the memory issue of multilingual models significantly. Especially for languages using the same (Latin) script, the additional characters to care about are very few. Also a significant part of the vocabulary and language fall into a few buckets, so training a joint model makes all the sense in the world - much like an Italian native speaker could likely study a scientific text in Spanish and understand its content, even without speaking the language.

The memory impact comes mostly from having bigger embedding layers that have to account for vocabulary in many languages (the most problematic case being Chinese and Japanese, with their huge set of tokens). But even there, the largest vocabularies in use are maybe of size 100k (vs. about 30k for English-only), with a hidden dimension of 4k that makes for a total of 400M parameters. It's a lot, but a drop in the ocean of 100B+ parameters (or 1T+ for GPT4) we're seeing today.

P.S. Answering to GP, I think the Pile is English only, though - or at least, models on HuggingFace trained on the Pile, like the various Pythia models, are tagged as English only.

Roark662y ago

The biggest difficulty people have IMO when trying to train language models in non-English languages is that there is not enough text written in these other languages to select a big good quality dataset.

Also, there are lots of (poorly) machine translated websites in Polish... So any dataset that contains web crawl will have precisely what I'd prefer not to have.

Ideally I'd see either the national gov, or the EU to invest money into creation of more high quality datasets in all EU languages.

So when I point out the failures of for example chatgpt in my language I do so while being amazed it can generate and understand Polish at all.

Also regarding multilingual model size being larger. I never heard this before, but it seems logical. I have heard models gain extra performance on English tasks when they are trained on other languages too so there is a benefit to adding multilingual datasets.

cschmidt2y ago

The Pile and Red Pajama are primarily English language datasets. If you want something multilingual, I'd suggest having a look at the Bloom dataset https://arxiv.org/abs/2210.14712

Der_Einzige2y ago· 3 in thread

I came so close to getting my dataset DebateSum (https://huggingface.co/datasets/Hellisotherpeople/DebateSum) into the pile, but they decided at the last minute not to add it: https://github.com/EleutherAI/the-pile/issues/56

I'm still a tiny bit salty about that, but the pile is a wonderful dataset regardless.

orange_fritter2y ago

That dataset looks cool. Good work either way, I'm sure it'll go somewhere

Der_Einzige2y ago

Stay tuned! I've got a paper I'm writing about a new followup which is a 40x improvement in size (basically every open source debate card... Ever) and a 40x improvement in metadata and duplication detection. The work is all done since late april and I've just been lazy/writer-blocked (ironic in a world of high end LLMs) and haven't gotten the paper finished.

Kinda of sad to have missed NeurIPS dataset track deadline and ACL, but I know that anything close to this in scope is a slam-dunk accept at the argument mining workshop

robmsmt2y ago

Would love to see an early version of it!

cschmidt2y ago· 1 in thread

If you’re looking at The Pile, you also might consider the Red Pajama dataset. A new cleaned version was released recently https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...

CamperBob22y ago

Is there a straightforward way to download that dataset, the way there was for the original RedPajama data? SlimPajama appears to have been released as 60,000 small files, which is ridiculous.

charlyslOP2y ago· 1 in thread

OP here. I learned about this while reading Stanford's LLM course's "Data" lecture [1]. Very interesting how it assesses the datasets used for GPT 2 and 3, etc, and how The Pile addresses their issues. A very interesting course!

[1] https://stanford-cs324.github.io/winter2022/lectures/data/

pjot2y ago

The Pile was also referenced in a post today of some guys tweets about “leaked” gpt4 details

https://news.ycombinator.com/item?id=36675934

dang2y ago

The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=36272365 - June 2023 (5 comments)

The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=25607809 - Jan 2021 (60 comments)

ryoshiro2y ago

Side Topic: In the leaked OpenAI GPT-training details, there are speculations that OpenAI trained on Libgen dataset. Is there a link to the dataset of Libgen, if so how big is it?

j / k navigate · click thread line to collapse

70 comments

56 comments · 8 top-level

sillysaurusx2y ago· 34 in thread

andy992y ago

nullc2y ago

Even RMS himself prefers the abolition of copyright over the existence of the GPL.

sillysaurusx2y ago

Feel free to chat! You can DM me on Twitter or email me. I’ve been in the hospital with my wife for 7 weeks, with two to go, so I’ve been a bit less responsive than I usually am.

1 more reply

js82y ago

> otherwise we're just going to see everything good disappear behind trade secret

IMHO allowing models to be copyrighted is basically 18th century enclosures again.

YetAnotherNick2y ago

> All the code I work on commercially is copyrighted

What do you mean by that? Do you continuously copyright the changes?

2 more replies

amoss2y ago

> If you’re sceptical, go look at one of the forums where people are building derivatives of Stable Diffusion (possibly NSFW, I’m not providing any links).

zarzavat2y ago

vr462y ago

londons_explore2y ago

I'd be interested to know how your model performs if it is trained only on your own work.

Assuming you haven't taken a photo a second for your entire life, then I suspect you'll struggle to make something even close to what's available publically, due to lack of training data.

1 more reply

GaggiX2y ago

How big is the archive? These models are typically trained on at least 100M images.

1 more reply

hedgehog2y ago

sillysaurusx2y ago

This is true, and it's why I hesitated to file legal action. My goal was to benefit hackers. If the outcome causes problems for people who are just trying to share their work, I'd be upset.

tensor2y ago

I think it would be a shame to try to build legislation around the notion of the sass melting pot application of machine learning and in the process destroy all sorts of other use cases.

1 more reply

hedgehog2y ago

- Grammarly. Clear value prop, if weights can't be adequately protected then that's a significant headwind against doing processing on the client.

- Adobe Firefly. Could run locally, they understand the technical challenges well, same headwind.

- GitHub Copilot. Same.

idiotsecant2y ago

hedgehog2y ago

archivist02y ago

I'm afraid to say... the-eye no longer hosts the pile as of today due to legal threats above the likes of DMCA.

Though I believe it's still available via its original torrent and on at.

> https://academictorrents.com/details/0d366035664fdf51cfbe9f7...

sfriedr2y ago

How about trying to upload the Pile to Zenodo? Only half-joking :D

koheripbal2y ago

I'm more interested in The Pile V2 which seems to have gone underground...

sfriedr2y ago

I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

[1] https://www.theguardian.com/books/2023/jul/05/authors-file-a... [2] https://arxiv.org/abs/2105.05241 [3] https://arxiv.org/abs/2301.13867

sillysaurusx2y ago

> I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

You'll probably hear about it either way though, since it's a groundbreaking case. No one has tested the copyrightability of ML models before.

jacquesm2y ago

What exactly is it that you claim copyright over? Are you sure that you have standing to bring that suit?

1 more reply

Der_Einzige2y ago

Getting sued is straight up a good thing for most peoples careers in tech. Haven't you watched silicon valley?

jacquesm2y ago

> It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.

nl2y ago

Learning from something has never been copyright violation before, even when a computer was learning (eg, building a search index from copyrighted data is fair use; cite: Google cases).

cornel_io2y ago

jacquesm2y ago

There are more courts than just the US ones.

1 more reply

rpdillon2y ago

> on the hook for the cumulative cost of the copyright they violated.

ben_w2y ago

Current AI models don't compete with a book, from what I've seen; I wouldn't want to bet how long it takes before they can compete with not just one but all books.

1 more reply

lukemerrick2y ago

koheripbal2y ago

By that token, my brain is a derivative work of all the copyrighted works I've consumed

koheripbal2y ago

What ever happened with The Pile V2? I spent a couple hours searching for it, but the Eye is impossible to navigate and people on the discord generally invite noobs like myself.

moffkalast2y ago

> They DMCA’ed one of my repos distributing LLaMA

Boy they'll be mad once they learn about Huggingface distributing thousands of LLama fine tunes with full weights.

redox992y ago

Weights being copyrightable is already a questionable thing. Derivatives (like finetunes) is even more questionable.

robertheadley2y ago· 5 in thread

As long as LLMs and generative AI uses copywritten works for training, then they are going to be the enemy of creative people.

koheripbal2y ago

This is like saying that my brain violates copyright when I write sci-fi because one time, years ago, I watched Star Wars.

mattkevan2y ago

Creative people will be using LLMs and other models as new and exciting creative tools.

Their real enemies will be the people who make money off the creative people’s work, e.g. the entire history of recorded music or the current writers strike.

splatzone2y ago

Unless the financial benefit could be shared with the original authors somehow, with some kind of royalties system?

fsckboy2y ago

I love how "creatives" enjoy the freedom of the free internet but never try to shame their peers as to whether they use GPL or MIT license for their art.

robertheadley2y ago

I think the more matter of fact the influence, the more the original artist deserves compensation. See Waits v. Frito-Lay, Inc.

I do not not want something like this to happen to generative AI and make things more difficult for the technology to progress and flourish.

Roark662y ago· 4 in thread

Great stuff, I skimmed the article searching for some table showing a breakdown of content by language, but I haven't found one.

koheripbal2y ago

Optimally, a Polish audience should try to create a Polish trained model.

As it stands now, most advanced models, like gpt are multilingual, but are noticeably less capable in non-English languages.

spi2y ago

P.S. Answering to GP, I think the Pile is English only, though - or at least, models on HuggingFace trained on the Pile, like the various Pythia models, are tagged as English only.

Roark662y ago

Also, there are lots of (poorly) machine translated websites in Polish... So any dataset that contains web crawl will have precisely what I'd prefer not to have.

Ideally I'd see either the national gov, or the EU to invest money into creation of more high quality datasets in all EU languages.

So when I point out the failures of for example chatgpt in my language I do so while being amazed it can generate and understand Polish at all.

cschmidt2y ago

The Pile and Red Pajama are primarily English language datasets. If you want something multilingual, I'd suggest having a look at the Bloom dataset https://arxiv.org/abs/2210.14712

Der_Einzige2y ago· 3 in thread

I'm still a tiny bit salty about that, but the pile is a wonderful dataset regardless.

orange_fritter2y ago

That dataset looks cool. Good work either way, I'm sure it'll go somewhere

Der_Einzige2y ago

Kinda of sad to have missed NeurIPS dataset track deadline and ACL, but I know that anything close to this in scope is a slam-dunk accept at the argument mining workshop

robmsmt2y ago

Would love to see an early version of it!

cschmidt2y ago· 1 in thread

If you’re looking at The Pile, you also might consider the Red Pajama dataset. A new cleaned version was released recently https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...

CamperBob22y ago

Is there a straightforward way to download that dataset, the way there was for the original RedPajama data? SlimPajama appears to have been released as 60,000 small files, which is ridiculous.

charlyslOP2y ago· 1 in thread

[1] https://stanford-cs324.github.io/winter2022/lectures/data/

pjot2y ago

The Pile was also referenced in a post today of some guys tweets about “leaked” gpt4 details

https://news.ycombinator.com/item?id=36675934

dang2y ago

The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=36272365 - June 2023 (5 comments)

The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=25607809 - Jan 2021 (60 comments)

ryoshiro2y ago

Side Topic: In the leaked OpenAI GPT-training details, there are speculations that OpenAI trained on Libgen dataset. Is there a link to the dataset of Libgen, if so how big is it?

j / k navigate · click thread line to collapse