That entity will scrape the internet and train the models and claim that "it's just research" to be able to claim that all is fair-use.
At this point it's not even funny anymore.
It does suit the modus operandi of a number of American companies that start out as literally illegal/criminal operations until they get big and rich enough to pay a fine for their youthful misdeeds.
By the time some of them get huge, they're in bed with the government to dominate the market.
It's completely unprecedented.
We allowed scraping images and text en masse when search engines used the data to let us find stuff.
We allow copying of style, and don't allow writing styles and aesthetics to be copyrighted or trademarked.
Then AI shows up, and people change lanes because they don't like the results.
One of the things that made me tilt towards the side of fair use was a breakdown of the Stable Diffusion model. The SD2.1 base model was trained on 5.85 billion images, all normalized to 512x512 BMP. That's 1MB per images, for a total of 5.85PB of BMP files. The resulting model is only 5.2GB. That's more than 99.999999% data loss from the source data to the trained set.
For every 1MB BMP file in the training dataset, less than 1byte makes it into the model.
I find it extremely difficult to call this redistribution of copyrighted data. It falls cleanly into fair use.
Their arguments against this amounts to "we're not using it like they intend it to be used, so it's fine if we obtain it illegally", and that's a bs standard, totally divorced from any legal reality.
Fair Use covers certain transformative uses, certainly, but it doesn't cover illegal obtaining of the content.
You can't pirate a book just because you want to use it transformatively (which is exactly what they've done), and that argument would never hold up for us as individuals, so we sure as hell shouldn't let tech companies get a special carve-out for it.
burning the bridge so nobody else can legally scrape, that's the line.
The anti-AI stance is what is baffling to me. The path trotten is what got us here and obviously nobody could have paid people upfront for the wild experimentation that was necessary. The only alternative is not having done it.
Given the path it has put as in, people either are insanely cruel or just completely detached from reality when it comes to what is necessary to do entirely new things.
Perhaps the biggest “needs citation” statement of our time.
OpenAI's case is especially egregious, with the entire starting as 'open' and reaping the benefits, then doing its best in every way to shut the door after itself by scaring people over AI apocalypses. If your argument is seriously that it is necessary to shamelessly steal and lie to do new things, I question your ethical standards, especially in the face of all the openly developed models out there.
The anti-AI stance is what is baffling to me.
I think it’s unfair to paint any legal controls over this incredibly important, high-stakes technology as being “anti”. They’re not trying to prevent innovation because they’re cruel, they’re just trying to somewhat slow down innovation so that we can ensure it’s done with minimal harm (eg making sure content creators are compensated in a time of intense automation). Like we do for all sorts of other fields of research, already!And isn’t this what basically every single scholar in the field says they want, anyway - safe, intentional, controlled deployment?
As you can tell from the above, I’m as far from being “anti-AI” or technically pessimistic as one can be — I plan to dedicate my life to its safe development. So there’s at least one counterexample for you to consider :)
> The anti-AI stance is what is baffling to me
I don't see s lot of anti AI but instead I see a concern for how it's just being managed and controlled by the larger companies with resources that no start up could dream. Open AI was to release it's models and be well.. Open but fine they're not. But their behaviour of how things are proceeding are questionable and unnecessarily aggravating.
"Hugely beneficial" is a stretch at this point. It has the potential to be hugely beneficial, sure, but it also has the potential to be ruinous.
We're already seeing GenAI being used to create disinformation at scale. That alone makes the potential for this being a net-negative very high.
I don't think this is the "ends justify the means" argument you think it is.
I do not have confidence in the Supreme Court in general, and I think there's a real risk that in deciding on AI training they upend copyright of digital materials in a way that makes it worse for everyone.
The crazy thing is that there hasn't been an injunction to make them stop.
I've no idea if it could be valid when it comes to OpenAI, but it does seem to be a general concept designed to counter wrongdoers who take a little value from a lot of people?
Update: ML doesn't copy information. It can merely memorise some small portions of it.
I'm surprised people are surprised.
>> That entity will scrape the internet and train the models and claim that "it's just research" to be able to claim that all is fair-use.
a lot of people and entities do this though... openAI is in the spotlight, but scraping everything and selling it is the business model for a lot of companies...
In my eyes, all genAI companies/tools are the same. I dislike all equally, and I use none of them.
That's the business model of lots of companies. Take, collect and collate data, put it in a new format more useful for your field/customers, resell.