The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).
And they will have much better knowledge, answers, etc than the western, Lawyer approved models.
Sometimes knowledge needs to be set free I guess.
At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.
Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.
Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.
its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same
ie,
you're allowed to scrape the web
you're allowed to take what you scrape and put it in a database
you're allowed to use your database to inform on decisions you might make, or content you might create
but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before
and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business
They probably can:
https://github.com/zjunlp/EasyEdit
> I wonder if this is going to cause issues down the road.
There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.
... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.
No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.
Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.
In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.
In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.
Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.
Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.
https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'
This is not charity, this is a shrewd business move.
My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.
I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.
How can I play with open source LLM's locally?
You can leverage those big CPUs while still loading both GPUs with a 65B model.
... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/
It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.
Also, it has no "1 click" exe release like kobold.
I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.
It's a System76 machine, they make good stuff
Well now there is a commerical release. I guess it wasn't some corporate plot after all!
Some people just can't admit when a corporation does a good thing.
(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)
Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?
Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?
As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.
It seems like the existing large platforms of today—Microsoft’s enterprise moat, Google’s ads and internet services, Meta’s social networks, Apple’s consumer and mobile products—will remain the primary platforms of the future. So having models that can operate exclusively on those platforms via integration to their key products and date will only continue this trend. If you’re an outsider with an AI model, you’ll have a harder time getting access to critical data and your standalone AI product (e.g., ChatGPT) won’t be as useful.
More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.
It also helps that they buy or otherwise cooperate to destroy their competition in questionable ways while heavily lobbying the gov to favor them over others in a quid-pro-quo that benefits politicians and not their constituents.
I disagree: I think big tech is hard to disrupt ATM because the companies are still young and nimble. In the last cycle, the companies being displaced were ancient (by tech standards). When Google and Facebook are 30 years old, their DNA will get in the way of adopting to a new paradigm that will change the world. A paradigm that may be to the Metaverse what the smartphone was to the Apple Newton
Maybe that's Meta's play here? Maybe the idea is that the ecosystem around a model could be as valuable or more valuable than the model itself too, so an OSS model could benefit Meta a lot more by gaining more of the ecosystem mind share?
Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?
They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.
Facebook benefits heavily from the open source development done on LLaMA. There was a report I saw that facebook has started using llama.cpp internally for inference. Updates to the licensing will cement facebook as the go to choice for open source language models.
My hypothesis based on the context of Mark discussing the release is that it's going to be completely open source and can licensed to be used commercially. Not that Meta is going to add a whole new revenue side of business to compete with OpenAI. i.e. "Here is model, with commercially permissive licensing" not "Here is model that you can use commercially but must pay me"
https://www.youtube.com/watch?v=Ff4fRgnuFgQ&ab_channel=LexFr...
They can even write it as 'good will' on their financial statements.
It kind of is working.
This seems they will release the weights under some license that allows commercial usage.
How they monetise it (which I assume they will try and do?) is an interesting question.
Maybe some variant of paying a licencing fee?
There doesn't necessarily have to be one. Facebook's goal may be to help commoditize its complements. https://gwern.net/complement
hardware is the only moat
If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.
QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs
Open-source commercial?
Free as in beer Vs free as in speech and the whole thing.
If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.