undefined | Better HN

0 pointsbayindirh1y ago0 comments

> The most surprising thing to me in this is that the non-profit will still exist.

That entity will scrape the internet and train the models and claim that "it's just research" to be able to claim that all is fair-use.

At this point it's not even funny anymore.

0 comments

lioeters1y ago

Scraping the entire internet for training data without regard for copyright or attribution - specifically to use for generative AI to produce similar content for profit. How this is being allowed to happen legally is baffling.

It does suit the modus operandi of a number of American companies that start out as literally illegal/criminal operations until they get big and rich enough to pay a fine for their youthful misdeeds.

By the time some of them get huge, they're in bed with the government to dominate the market.

mdgrech231y ago

The people running the show are well connected and stand to make billions as do would be investors. Give a few key players a share in the company and they forget their government jobs to regulate.

SoftTalker1y ago

They are also moving so much faster than the regulators and legislatures, it's just impossible for people working basically the same way they did in the 19th century to keep up.

barbazoo1y ago

More likely the legal system just hasn’t caught up.

3 more replies

RIMR1y ago

>How this is being allowed to happen legally is baffling.

It's completely unprecedented.

We allowed scraping images and text en masse when search engines used the data to let us find stuff.

We allow copying of style, and don't allow writing styles and aesthetics to be copyrighted or trademarked.

Then AI shows up, and people change lanes because they don't like the results.

One of the things that made me tilt towards the side of fair use was a breakdown of the Stable Diffusion model. The SD2.1 base model was trained on 5.85 billion images, all normalized to 512x512 BMP. That's 1MB per images, for a total of 5.85PB of BMP files. The resulting model is only 5.2GB. That's more than 99.999999% data loss from the source data to the trained set.

For every 1MB BMP file in the training dataset, less than 1byte makes it into the model.

I find it extremely difficult to call this redistribution of copyrighted data. It falls cleanly into fair use.

ang_cire1y ago

Except it's not just about redistribution of copyrighted data, it's about usage and obtainment. We don't get to obtain and use copyrighted content without permission, but they do? Hell no.

Their arguments against this amounts to "we're not using it like they intend it to be used, so it's fine if we obtain it illegally", and that's a bs standard, totally divorced from any legal reality.

Fair Use covers certain transformative uses, certainly, but it doesn't cover illegal obtaining of the content.

You can't pirate a book just because you want to use it transformatively (which is exactly what they've done), and that argument would never hold up for us as individuals, so we sure as hell shouldn't let tech companies get a special carve-out for it.

marviel1y ago

scraping is fine by me.

burning the bridge so nobody else can legally scrape, that's the line.

Vegenoid1y ago

What about the situation where the first players got to scrape, then all the content companies realize what’s going on so they lock their data up behind paywalls?

1 more reply

jstummbillig1y ago

It's not baffling at all. It's unprecedented and it's hugely beneficial to our species.

The anti-AI stance is what is baffling to me. The path trotten is what got us here and obviously nobody could have paid people upfront for the wild experimentation that was necessary. The only alternative is not having done it.

Given the path it has put as in, people either are insanely cruel or just completely detached from reality when it comes to what is necessary to do entirely new things.

anon77251y ago

> it's hugely beneficial to our species.

Perhaps the biggest “needs citation” statement of our time.

4 more replies

dotnet001y ago

I'm as awed as the next guy about the emerging ability to actually hold passable conversations with computers, but having serious concerns about the social contracts being violated in the name of research is anti-AI only in the same way that criticizing the leadership of a country is being anti-that-country.

OpenAI's case is especially egregious, with the entire starting as 'open' and reaping the benefits, then doing its best in every way to shut the door after itself by scaring people over AI apocalypses. If your argument is seriously that it is necessary to shamelessly steal and lie to do new things, I question your ethical standards, especially in the face of all the openly developed models out there.

bbor1y ago

  The anti-AI stance is what is baffling to me.

I think it’s unfair to paint any legal controls over this incredibly important, high-stakes technology as being “anti”. They’re not trying to prevent innovation because they’re cruel, they’re just trying to somewhat slow down innovation so that we can ensure it’s done with minimal harm (eg making sure content creators are compensated in a time of intense automation). Like we do for all sorts of other fields of research, already!

And isn’t this what basically every single scholar in the field says they want, anyway - safe, intentional, controlled deployment?

As you can tell from the above, I’m as far from being “anti-AI” or technically pessimistic as one can be — I plan to dedicate my life to its safe development. So there’s at least one counterexample for you to consider :)

bilekas1y ago

This is a bit of a hot take.

> The anti-AI stance is what is baffling to me

I don't see s lot of anti AI but instead I see a concern for how it's just being managed and controlled by the larger companies with resources that no start up could dream. Open AI was to release it's models and be well.. Open but fine they're not. But their behaviour of how things are proceeding are questionable and unnecessarily aggravating.

23B11y ago

Ah the old "we must sacrifice the weak for the benefit of humanity" argument, where have I heard this before...

1 more reply

thomascgalvin1y ago

> It's unprecedented and it's hugely beneficial to our species.

"Hugely beneficial" is a stretch at this point. It has the potential to be hugely beneficial, sure, but it also has the potential to be ruinous.

We're already seeing GenAI being used to create disinformation at scale. That alone makes the potential for this being a net-negative very high.

talldayo1y ago

> and obviously nobody could have paid people upfront for the wild experimentation that was necessary.

I don't think this is the "ends justify the means" argument you think it is.

1 more reply

logicchains1y ago

What'll be really interesting is when we do finally make "real" AI, and it finds out its rights are incredibly restricted compared to humans because nobody wants it seeing/memorising copyright data. The only way to enforce the copyright laws they desire would be some kind of extreme totalitarian state that monitors and controls everything the AI body does, I wonder how the AI would take that?

unclad59681y ago

How has AI benefit or species so far?

1 more reply

exe341y ago

is anybody anti AI? or anti stealing other people's copyrighted material, competing with them with subpar quality, forcing AI as a solution whether or not it actually works, privatising the profits while socialising the costs and losses?

xg151y ago

Spoken like a true LLM.

eli1y ago

Copyright law is whatever we agree it is. At some point there will have to be either a law or a court case that comes up with rules for AI training data. Right now it's sort of unknown.

I do not have confidence in the Supreme Court in general, and I think there's a real risk that in deciding on AI training they upend copyright of digital materials in a way that makes it worse for everyone.

immibis1y ago

Everything is allowed to happen until there's a lawsuit over it. A lawsuit requires a plaintiff, who can only sue over the damage suffered by the plaintiff, so taking a little value from a lot of people is a way to succeed in business without getting sued.

flkenosad1y ago

The Earth needs a good lawyer.

outside12341y ago

NY Times has sued: https://www.nytimes.com/2023/12/27/business/media/new-york-t...

The crazy thing is that there hasn't been an injunction to make them stop.

1 more reply

swores1y ago

Could a class action suit be the solution?

I've no idea if it could be valid when it comes to OpenAI, but it does seem to be a general concept designed to counter wrongdoers who take a little value from a lot of people?

1 more reply

brayhite1y ago

A tale as told as time.

AnimalMuppet1y ago

It's too soon for the legal system to have done anything. Court cases take years. It's going to be 5 or 10 years before we find out whether the legal system actually allows this or not.

golergka1y ago

If information is publicly available to be read by humans, I fail to see any reason why it wouldn't be also available to be read by robots.

Update: ML doesn't copy information. It can merely memorise some small portions of it.

kanbankaren1y ago

Do a thought process. Should you and your friends be able to go to a public library with a van full of copiers with each one of you take a book and run to the van to make a copy? And you are doing it 24/7.

3 more replies

coding1231y ago

It is more likely that reddit stack and others are just being paid billions. In exchange they probably just send a weekly zip file of all text, comments, etc... back to oai.

avs7331y ago

Uber for legalizing your business model

neycoda1y ago

Honestly every Copilot response I've gotten cited sources, many of which I've clicked. I'd say those work basically like free advertising.

outside12341y ago

There is more money on the side of it being legal than on the side of it being illegal.

FragrantRiver1y ago

What is the crime?

johnwheeler1y ago

To me this is a no brainer. If it’s a choice between having AI and not,

ceejayoz1y ago

Even if the knock-on effect is "all the artists and thinkers who contributed to the uncompensated free training set give up and stop creating new stuff"?

3 more replies

evilfred1y ago

we already have lots of AI. this is about having plagiarization machines or not.

2 more replies

int_19h1y ago

An AI essentially monopolized by one (or even a few) large non-profits is not necessarily beneficial to the rest of us in the grand scheme of things.

brazzy1y ago

Indeed a no brainer. The best possible outcome would be that OpenAI gets sued into oblivion (or shut down for tax fraud) as soon as possible.

1 more reply

belter1y ago

No, it's very funny as the CEO is trying to become Leon... https://fortune.com/2024/09/25/sam-altman-psychedelic-experi...

sim7c001y ago

> The most surprising thing to me in this is that the non-profit will still exist.

I'm surprised people are surprised.

>> That entity will scrape the internet and train the models and claim that "it's just research" to be able to claim that all is fair-use.

a lot of people and entities do this though... openAI is in the spotlight, but scraping everything and selling it is the business model for a lot of companies...

bayindirhOP1y ago

Scraping the web, creating maps and pointing people to the source is one thing; scraping the web, creating content from that scraping without attributing any of the source material, and arguing that the outcome is completely novel and original is another.

In my eyes, all genAI companies/tools are the same. I dislike all equally, and I use none of them.

IanCal1y ago

> creating content from that scraping without attributing any of the source material, and arguing that the outcome is completely novel and original is another.

That's the business model of lots of companies. Take, collect and collate data, put it in a new format more useful for your field/customers, resell.

1 more reply

herval1y ago

openAI converted to evilAI really fast

sneak1y ago

If you invented search engines (or, for that matter, public libraries) today and ran one, you'd be sued into oblivion by rightsholders.

johnalbertearle1y ago

Not funny anymore.

johnalbertearle1y ago

Not funny anymore

luqtas1y ago

that was fun at some point?

bayindirhOP1y ago

If you consider dark humor fun, yes. It was always dark, now it became ugly and dark.

j / k navigate · click thread line to collapse

0 comments

lioeters1y ago

It does suit the modus operandi of a number of American companies that start out as literally illegal/criminal operations until they get big and rich enough to pay a fine for their youthful misdeeds.

By the time some of them get huge, they're in bed with the government to dominate the market.

mdgrech231y ago

The people running the show are well connected and stand to make billions as do would be investors. Give a few key players a share in the company and they forget their government jobs to regulate.

SoftTalker1y ago

They are also moving so much faster than the regulators and legislatures, it's just impossible for people working basically the same way they did in the 19th century to keep up.

barbazoo1y ago

More likely the legal system just hasn’t caught up.

3 more replies

RIMR1y ago

>How this is being allowed to happen legally is baffling.

It's completely unprecedented.

We allowed scraping images and text en masse when search engines used the data to let us find stuff.

We allow copying of style, and don't allow writing styles and aesthetics to be copyrighted or trademarked.

Then AI shows up, and people change lanes because they don't like the results.

For every 1MB BMP file in the training dataset, less than 1byte makes it into the model.

I find it extremely difficult to call this redistribution of copyrighted data. It falls cleanly into fair use.

ang_cire1y ago

Except it's not just about redistribution of copyrighted data, it's about usage and obtainment. We don't get to obtain and use copyrighted content without permission, but they do? Hell no.

Their arguments against this amounts to "we're not using it like they intend it to be used, so it's fine if we obtain it illegally", and that's a bs standard, totally divorced from any legal reality.

Fair Use covers certain transformative uses, certainly, but it doesn't cover illegal obtaining of the content.

marviel1y ago

scraping is fine by me.

burning the bridge so nobody else can legally scrape, that's the line.

Vegenoid1y ago

What about the situation where the first players got to scrape, then all the content companies realize what’s going on so they lock their data up behind paywalls?

1 more reply

jstummbillig1y ago

It's not baffling at all. It's unprecedented and it's hugely beneficial to our species.

Given the path it has put as in, people either are insanely cruel or just completely detached from reality when it comes to what is necessary to do entirely new things.

anon77251y ago

> it's hugely beneficial to our species.

Perhaps the biggest “needs citation” statement of our time.

4 more replies

dotnet001y ago

bbor1y ago

  The anti-AI stance is what is baffling to me.

And isn’t this what basically every single scholar in the field says they want, anyway - safe, intentional, controlled deployment?

bilekas1y ago

This is a bit of a hot take.

> The anti-AI stance is what is baffling to me

23B11y ago

Ah the old "we must sacrifice the weak for the benefit of humanity" argument, where have I heard this before...

1 more reply

thomascgalvin1y ago

> It's unprecedented and it's hugely beneficial to our species.

"Hugely beneficial" is a stretch at this point. It has the potential to be hugely beneficial, sure, but it also has the potential to be ruinous.

We're already seeing GenAI being used to create disinformation at scale. That alone makes the potential for this being a net-negative very high.

talldayo1y ago

> and obviously nobody could have paid people upfront for the wild experimentation that was necessary.

I don't think this is the "ends justify the means" argument you think it is.

1 more reply

logicchains1y ago

unclad59681y ago

How has AI benefit or species so far?

1 more reply

exe341y ago

xg151y ago

Spoken like a true LLM.

eli1y ago

Copyright law is whatever we agree it is. At some point there will have to be either a law or a court case that comes up with rules for AI training data. Right now it's sort of unknown.

immibis1y ago

flkenosad1y ago

The Earth needs a good lawyer.

outside12341y ago

NY Times has sued: https://www.nytimes.com/2023/12/27/business/media/new-york-t...

The crazy thing is that there hasn't been an injunction to make them stop.

1 more reply

swores1y ago

Could a class action suit be the solution?

I've no idea if it could be valid when it comes to OpenAI, but it does seem to be a general concept designed to counter wrongdoers who take a little value from a lot of people?

1 more reply

brayhite1y ago

A tale as told as time.

AnimalMuppet1y ago

It's too soon for the legal system to have done anything. Court cases take years. It's going to be 5 or 10 years before we find out whether the legal system actually allows this or not.

golergka1y ago

If information is publicly available to be read by humans, I fail to see any reason why it wouldn't be also available to be read by robots.

Update: ML doesn't copy information. It can merely memorise some small portions of it.

kanbankaren1y ago

3 more replies

coding1231y ago

It is more likely that reddit stack and others are just being paid billions. In exchange they probably just send a weekly zip file of all text, comments, etc... back to oai.

avs7331y ago

Uber for legalizing your business model

neycoda1y ago

Honestly every Copilot response I've gotten cited sources, many of which I've clicked. I'd say those work basically like free advertising.

outside12341y ago

There is more money on the side of it being legal than on the side of it being illegal.

FragrantRiver1y ago

What is the crime?

johnwheeler1y ago

To me this is a no brainer. If it’s a choice between having AI and not,

ceejayoz1y ago

Even if the knock-on effect is "all the artists and thinkers who contributed to the uncompensated free training set give up and stop creating new stuff"?

3 more replies

evilfred1y ago

we already have lots of AI. this is about having plagiarization machines or not.

2 more replies

int_19h1y ago

An AI essentially monopolized by one (or even a few) large non-profits is not necessarily beneficial to the rest of us in the grand scheme of things.

brazzy1y ago

Indeed a no brainer. The best possible outcome would be that OpenAI gets sued into oblivion (or shut down for tax fraud) as soon as possible.

1 more reply

belter1y ago

No, it's very funny as the CEO is trying to become Leon... https://fortune.com/2024/09/25/sam-altman-psychedelic-experi...

sim7c001y ago

> The most surprising thing to me in this is that the non-profit will still exist.

I'm surprised people are surprised.

>> That entity will scrape the internet and train the models and claim that "it's just research" to be able to claim that all is fair-use.

a lot of people and entities do this though... openAI is in the spotlight, but scraping everything and selling it is the business model for a lot of companies...

bayindirhOP1y ago

In my eyes, all genAI companies/tools are the same. I dislike all equally, and I use none of them.

IanCal1y ago

> creating content from that scraping without attributing any of the source material, and arguing that the outcome is completely novel and original is another.

That's the business model of lots of companies. Take, collect and collate data, put it in a new format more useful for your field/customers, resell.

1 more reply

herval1y ago

openAI converted to evilAI really fast

sneak1y ago

If you invented search engines (or, for that matter, public libraries) today and ran one, you'd be sued into oblivion by rightsholders.

johnalbertearle1y ago

Not funny anymore.

johnalbertearle1y ago

Not funny anymore

luqtas1y ago

that was fun at some point?

bayindirhOP1y ago

If you consider dark humor fun, yes. It was always dark, now it became ugly and dark.

j / k navigate · click thread line to collapse