As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.
There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.
As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: https://twitter.com/theshawwn/status/1641804013791215619?s=6.... They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.
One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/
Besides, it's already pretty unambiguous that weights are not copyrightable: they're a result of a mechanical process. The only original creative input that goes in to the weights is the unfathomable amounts of content scraped from other sources that aren't the authorship of the models. The objective of the gradient descent is simply minimizing loss on the training data.
Facebook doesn't own the llama model weights any more than the Bridgeman Art Library practically owns the paintings of European masters because they made quality scans of them. ( https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel.... ), or any more than Rural Telephone owns the phone directory ( https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R.... ).
Trying to make model weights copyrightable is going uphill, and I don't see how you get there without first establishing that the these LLM are unlawful derivatives of a countless number of copyrighted works along the way. Doing so would probably create a immediate monopoly for legally created LLMs for the hand full of corporations with quasi-monopoly content hosting services (facebook, google, etc) that can (and/or already have) stuffed licensing into their terms of use.
Do you want a cyberpunk dystopia? I think creating an AI monopoly is how you get a cyberpunk dystopia -- and the two ways we end up with one is either outright restrictions on private development of AI like some have been lobbying for and the other is the extension of copyright so that only a few entities can get access to enough of other people's data at a low enough cost to train them.
I’ll read over your essay and give it some thought. There are a bunch of subtle aspects to consider; I’ve been thinking it over for about four months now and still haven’t covered all the territory yet.
It feels like this may be one of the most important decisions going forward — both from an intellectual property point of view, and an individual rights perspective. E.g. you mention that it’s civil disobedience to share the weights, but it feels like if someone is claiming to do open science (LLaMA), sharing the research materials is the minimum requirement. Plus look how it’s benefited them; they’ve captured most of the open source LLM mindshare. So it seems likely that this will lead to more open source work in the long run, not less.
Feel free to chat! You can DM me on Twitter or email me. I’ve been in the hospital with my wife for 7 weeks, with two to go, so I’ve been a bit less responsive than I usually am.
We will see that anyway. All the code I work on commercially is copyrighted and yet a trade secret. Existence of copyright (with the exception of copyleft, but that's subversion) didn't help software to be open sourced.
IMHO allowing models to be copyrighted is basically 18th century enclosures again.
What do you mean by that? Do you continuously copyright the changes?
Does anybody have a link to a relevant discussion here? I would like to read about the creative process that goes into defining model weights, and how it differs from the mechanical output of running the training algorithm.
Assuming you haven't taken a photo a second for your entire life, then I suspect you'll struggle to make something even close to what's available publically, due to lack of training data.
Ultimately what convinced me to proceed is that there are immense forces pressuring ML models to become SaaS companies. It's very difficult to offer an ML model for extended periods without being a company. E.g. https://6b.eleuther.ai/ is down. Eleuther failing illustrates just how hard it is –– we were all working as hard as we could to design something that would last a long time, and a long time turned out to be two short years. Contrast that with other kinds of hacking (e.g. webdev, gamedev, hardware...) where the end result lasts basically forever.
So if ML models aren't copyrightable, I think it'll hurt companies a lot more than individuals. In fact the goal is the other way around: to protect individuals. All I did was publish Facebook's own GPL download script to github, and it got DMCA'd. If we don't push back on that kind of behavior now, companies will get used to the idea that they control "their" model –– even when their model is anything but theirs.
I think it would be a shame to try to build legislation around the notion of the sass melting pot application of machine learning and in the process destroy all sorts of other use cases.
- Grammarly. Clear value prop, if weights can't be adequately protected then that's a significant headwind against doing processing on the client.
- Adobe Firefly. Could run locally, they understand the technical challenges well, same headwind.
- GitHub Copilot. Same.
Copyright protection is probably not a single deciding issue in their product strategy but all of those are use cases that would for most users be better run locally as hardware can support that and are not going to because it's too much of a risk. Better to limit distribution and protect as trade secret.
The most powerful force for openness I see has nothing to do with copyright eligibility and everything to do with companies wanting to showcase their research arms to build brand and support recruiting. That leads me to believe it's probably better for models to be eligible for copyright and considered derivatives of all of the constituent training data. In some ways the better parallel is sampling in the music industry. It'll be interesting to see how this plays out.
I'm afraid to say... the-eye no longer hosts the pile as of today due to legal threats above the likes of DMCA.
Though I believe it's still available via its original torrent and on at.
> https://academictorrents.com/details/0d366035664fdf51cfbe9f7...
From a scientific point of view this is very problematic because few safeguards exist that guarantee that the dataset is not tampered with (as is the case if you'd upload it to Zenodo, which providea some guarantee of immutability).
How about trying to upload the Pile to Zenodo? Only half-joking :D
I'm the author of [3] where we introduced one of the first natural-language datasets that test graduate mathematics for LLMs, but some of the prompts we took from a copyrighted book and therefore thought about excluding them. Having them in the public dataset would be really nice though, hence I'm keen about your experience.
I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?
[1] https://www.theguardian.com/books/2023/jul/05/authors-file-a... [2] https://arxiv.org/abs/2105.05241 [3] https://arxiv.org/abs/2301.13867
Personally, I'm not worried. It would be a damn shame if academics come under fire merely for trying to operate on the cutting edge of science. None of us were trying to make money; we just wanted to make something interesting.
> I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?
Thanks! I think we might be putting up a website for it soon, if only to explain ourselves. In the meantime – I hate this phrase, since I don't want followers – the only way to keep informed is to follow my Twitter, and perhaps keep an eye on my HN comments.
You'll probably hear about it either way though, since it's a groundbreaking case. No one has tested the copyrightability of ML models before.
In my opinion the most ethical outcome would be that they are on the hook for the cumulative cost of the copyright they violated. That way authors would come out ahead instead of having their rights trashed 'because it's too late anyway'.
At some point this stuff will have to get tested by making its way up the appeals stack in the US, and IMO there is only a minuscule chance that will result in Google, MS, and Meta getting slapped with anything more than a token fine (my bet is it won't even be that), let alone paying every person who ever wrote anything that was used in these datasets for copyright violations, which would basically be everyone.
I think there's a strong argument for a Fair Use defense, given the size of the models versus the size of the training sets, as well as the gulf in intended use: an AI model doesn't compete with e.g. a book. Obviously we'll have to see if play out in court to find out.
Boy they'll be mad once they learn about Huggingface distributing thousands of LLama fine tunes with full weights.
Their real enemies will be the people who make money off the creative people’s work, e.g. the entire history of recorded music or the current writers strike.
I do not not want something like this to happen to generative AI and make things more difficult for the technology to progress and flourish.
I hope there is a lot of text in languages other than English. As for example in my language (Polish) current SOTA models are very deffiecient. I have wondered why is that considering companies like (not at all)OpenAI claim to train on large datasets including in my language of interest. It turns out (and I learned this just yesterday) they used LLM translated English content that that used as other language training data. They used Azure translator which itself is a transformer model to generate content for gpt-3.5 for example. Also, I bet there is a lot of poorly machine translated content in their supposedly "original" data.
The result? You can use chatgpt to write you an email of any kind in English and you can copy/paste/send immediately. Try doing that in Polish... It will make sense, but the language used will use bad tone (too familiar in a business setting), bad words(words that exist, but no real person would use) and sentence layout that just plainly feels weird. I suspect this is even worse in many other languages.
Optimally, a Polish audience should try to create a Polish trained model.
As it stands now, most advanced models, like gpt are multilingual, but are noticeably less capable in non-English languages.
Furthermore, I think you are exaggerating the memory issue of multilingual models significantly. Especially for languages using the same (Latin) script, the additional characters to care about are very few. Also a significant part of the vocabulary and language fall into a few buckets, so training a joint model makes all the sense in the world - much like an Italian native speaker could likely study a scientific text in Spanish and understand its content, even without speaking the language.
The memory impact comes mostly from having bigger embedding layers that have to account for vocabulary in many languages (the most problematic case being Chinese and Japanese, with their huge set of tokens). But even there, the largest vocabularies in use are maybe of size 100k (vs. about 30k for English-only), with a hidden dimension of 4k that makes for a total of 400M parameters. It's a lot, but a drop in the ocean of 100B+ parameters (or 1T+ for GPT4) we're seeing today.
P.S. Answering to GP, I think the Pile is English only, though - or at least, models on HuggingFace trained on the Pile, like the various Pythia models, are tagged as English only.
Also, there are lots of (poorly) machine translated websites in Polish... So any dataset that contains web crawl will have precisely what I'd prefer not to have.
Ideally I'd see either the national gov, or the EU to invest money into creation of more high quality datasets in all EU languages.
So when I point out the failures of for example chatgpt in my language I do so while being amazed it can generate and understand Polish at all.
Also regarding multilingual model size being larger. I never heard this before, but it seems logical. I have heard models gain extra performance on English tasks when they are trained on other languages too so there is a benefit to adding multilingual datasets.
I'm still a tiny bit salty about that, but the pile is a wonderful dataset regardless.
Kinda of sad to have missed NeurIPS dataset track deadline and ACL, but I know that anything close to this in scope is a slam-dunk accept at the argument mining workshop
[1] https://stanford-cs324.github.io/winter2022/lectures/data/
The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=36272365 - June 2023 (5 comments)
The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=25607809 - Jan 2021 (60 comments)