undefined | Better HN

0 pointslioeters1y ago0 comments

Scraping the entire internet for training data without regard for copyright or attribution - specifically to use for generative AI to produce similar content for profit. How this is being allowed to happen legally is baffling.

It does suit the modus operandi of a number of American companies that start out as literally illegal/criminal operations until they get big and rich enough to pay a fine for their youthful misdeeds.

By the time some of them get huge, they're in bed with the government to dominate the market.

0 comments

59 comments · 15 top-level

jstummbillig1y ago· 18 in thread

It's not baffling at all. It's unprecedented and it's hugely beneficial to our species.

The anti-AI stance is what is baffling to me. The path trotten is what got us here and obviously nobody could have paid people upfront for the wild experimentation that was necessary. The only alternative is not having done it.

Given the path it has put as in, people either are insanely cruel or just completely detached from reality when it comes to what is necessary to do entirely new things.

anon77251y ago

> it's hugely beneficial to our species.

Perhaps the biggest “needs citation” statement of our time.

Terr_1y ago

I can easily imagine people X decades from now discussing this stuff a bit like how we now view teeth-whitening radium toothpaste and putting asbestos in everything, or perhaps more like the abuse of Social Security numbers as authentication and redlining.

Not in any weirdly-self-aggrandizing "our tech is so powerful that robots will take over" sense, just the depressingly regular one of "lots of people getting hurt by a short-term profitable product/process which was actually quite flawed."

P.S.: For example, imagine having applications for jobs and loans rejected because all the companies' internal LLM tooling is secretly racist against subtle grammar-traces in your writing or social-media profile. [0]

[0] https://www.nature.com/articles/s41586-024-07856-5

3 more replies

50401y ago

Sometimes it seems like problem-solving itself is being problematized as if solving problems wasn't an obvious good.

2 more replies

jstummbillig1y ago

It does not need a citation. There is no citation. What it needs, right now, is optimism. Optimism is not optional when it comes to doing new things in the world. The "needs citation" is reserved for people who do nothing and chose to be sceptics until things are super obvious.

Yes, we are clearly talking about things to mostly still come here. But if you assign a 0 until its a 1 you are just signing out of advancing anything that's remotely interesting.

If you are able to see a path to 1 on AI, at this point, then I don't know how you would justify not giving it our all. If you see a path and in the end using all of human knowledge up to this point was needed to make AI work for us, we must do that. What could possibly be more beneficial to us?

This is regardless of all issues the will have to be solved and the enormous amount of societal responsibility this puts on AI makers — which I, as a voter, will absolutely hold them accountable for (even though I am actually fairly optimistic they all feel the responsibility and are somewhat spooked by it too).

But that does not mean I think it's responsible to try and stop them at this point — which the copyright debate absolutely does. It would simply shut down 95% of AI, tomorrow, without any other viable alternative around. I don't understand how that is a serious option for anyone who roots for us.

8 more replies

CamperBob21y ago

The burden of proof is on the people claiming that a powerful new technology won't ultimately improve our lives. They can start by pointing out all the instances in which their ancestors have proven correct after saying the same thing.

dotnet001y ago

I'm as awed as the next guy about the emerging ability to actually hold passable conversations with computers, but having serious concerns about the social contracts being violated in the name of research is anti-AI only in the same way that criticizing the leadership of a country is being anti-that-country.

OpenAI's case is especially egregious, with the entire starting as 'open' and reaping the benefits, then doing its best in every way to shut the door after itself by scaring people over AI apocalypses. If your argument is seriously that it is necessary to shamelessly steal and lie to do new things, I question your ethical standards, especially in the face of all the openly developed models out there.

bbor1y ago

  The anti-AI stance is what is baffling to me.

I think it’s unfair to paint any legal controls over this incredibly important, high-stakes technology as being “anti”. They’re not trying to prevent innovation because they’re cruel, they’re just trying to somewhat slow down innovation so that we can ensure it’s done with minimal harm (eg making sure content creators are compensated in a time of intense automation). Like we do for all sorts of other fields of research, already!

And isn’t this what basically every single scholar in the field says they want, anyway - safe, intentional, controlled deployment?

As you can tell from the above, I’m as far from being “anti-AI” or technically pessimistic as one can be — I plan to dedicate my life to its safe development. So there’s at least one counterexample for you to consider :)

bilekas1y ago

This is a bit of a hot take.

> The anti-AI stance is what is baffling to me

I don't see s lot of anti AI but instead I see a concern for how it's just being managed and controlled by the larger companies with resources that no start up could dream. Open AI was to release it's models and be well.. Open but fine they're not. But their behaviour of how things are proceeding are questionable and unnecessarily aggravating.

23B11y ago

Ah the old "we must sacrifice the weak for the benefit of humanity" argument, where have I heard this before...

educasean1y ago

Who are the weak being "sacrificed"?

And who is the one calling for action?

Sorry for being dense, but I'm trying to understand if I'm the "strong" or the "weak" in your analogy.

1 more reply

thomascgalvin1y ago

> It's unprecedented and it's hugely beneficial to our species.

"Hugely beneficial" is a stretch at this point. It has the potential to be hugely beneficial, sure, but it also has the potential to be ruinous.

We're already seeing GenAI being used to create disinformation at scale. That alone makes the potential for this being a net-negative very high.

talldayo1y ago

> and obviously nobody could have paid people upfront for the wild experimentation that was necessary.

I don't think this is the "ends justify the means" argument you think it is.

6gvONxR4sf7o1y ago

Not just that. It's "the ends might justify the means if this path turns out to be the right one." I remember reading the same thing each time a self driving car company killed someone. "We need this hacky dangerous way of development to save lives sooner" and then the company ends up shuttered and there aren't any ends justifying means. Which means it's bs, regardless of how you feel about 'ends justify the means' as a valid argument.

logicchains1y ago

What'll be really interesting is when we do finally make "real" AI, and it finds out its rights are incredibly restricted compared to humans because nobody wants it seeing/memorising copyright data. The only way to enforce the copyright laws they desire would be some kind of extreme totalitarian state that monitors and controls everything the AI body does, I wonder how the AI would take that?

unclad59681y ago

How has AI benefit or species so far?

educasean1y ago

How has the Internet? How has automobiles? Feels like a rather aimless question.

2 more replies

exe341y ago

is anybody anti AI? or anti stealing other people's copyrighted material, competing with them with subpar quality, forcing AI as a solution whether or not it actually works, privatising the profits while socialising the costs and losses?

xg151y ago

Spoken like a true LLM.

johnwheeler1y ago· 9 in thread

To me this is a no brainer. If it’s a choice between having AI and not,

ceejayoz1y ago

Even if the knock-on effect is "all the artists and thinkers who contributed to the uncompensated free training set give up and stop creating new stuff"?

idunnoman12221y ago

Recording devices, you know a record player had a profound effect on artists. go back

2 more replies

brvsft1y ago

If an "artist" or "thinker" stops because of this, I question their motivations and those labels in the first place.

4 more replies

evilfred1y ago

we already have lots of AI. this is about having plagiarization machines or not.

mlazos1y ago

Computers already were plagiarizing machines, not sure what the difference is tbh. The same laws will apply.0

johnwheeler1y ago

Yeah we got that AI through scraping.

int_19h1y ago

An AI essentially monopolized by one (or even a few) large non-profits is not necessarily beneficial to the rest of us in the grand scheme of things.

brazzy1y ago

Indeed a no brainer. The best possible outcome would be that OpenAI gets sued into oblivion (or shut down for tax fraud) as soon as possible.

Sakos1y ago

So no AI for anybody? I don't see how that's better.

1 more reply

mdgrech231y ago· 5 in thread

The people running the show are well connected and stand to make billions as do would be investors. Give a few key players a share in the company and they forget their government jobs to regulate.

SoftTalker1y ago

They are also moving so much faster than the regulators and legislatures, it's just impossible for people working basically the same way they did in the 19th century to keep up.

barbazoo1y ago

More likely the legal system just hasn’t caught up.

llm_trw1y ago

Maybe, but for the first time in a century there is more money to be made in weakening copyright rather than strengthening it.

4 more replies

rayiner1y ago

You’re both correct. The legal system has absolutely no idea how to handle the copyright issues around using content for AI training data. It’s a completely novel issue. At the same time, the tech companies have a lot more money to litigate favorable interpretations of the law than the content companies.

xpe1y ago

Copyright concerns are only the tip of the iceberg. Think about the range of other harms and disruptions for countries and the world.

immibis1y ago· 5 in thread

Everything is allowed to happen until there's a lawsuit over it. A lawsuit requires a plaintiff, who can only sue over the damage suffered by the plaintiff, so taking a little value from a lot of people is a way to succeed in business without getting sued.

flkenosad1y ago

The Earth needs a good lawyer.

outside12341y ago

NY Times has sued: https://www.nytimes.com/2023/12/27/business/media/new-york-t...

The crazy thing is that there hasn't been an injunction to make them stop.

coding1231y ago

judges got to eat

swores1y ago

Could a class action suit be the solution?

I've no idea if it could be valid when it comes to OpenAI, but it does seem to be a general concept designed to counter wrongdoers who take a little value from a lot of people?

immibis1y ago

It doesn't seem to work very well

golergka1y ago· 4 in thread

If information is publicly available to be read by humans, I fail to see any reason why it wouldn't be also available to be read by robots.

Update: ML doesn't copy information. It can merely memorise some small portions of it.

kanbankaren1y ago

Do a thought process. Should you and your friends be able to go to a public library with a van full of copiers with each one of you take a book and run to the van to make a copy? And you are doing it 24/7.

mypalmike1y ago

This metaphor is quite stretched.

A more fitting metaphor would be something like... If you had the ability to read all the books in the library extremely quickly, and to make useful mental connections between the information you read such that people would come to you for your vast knowledge, should you be allowed in the library?

shagie1y ago

I would hold them exactly to the same standard.

https://www.copyright.gov/title37/201/37cfr201-14.html

    § 201.14 Warnings of copyright for use by certain libraries and archives.

    ....

    The copyright law of the United States (title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material.

    Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specific conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use,” that user may be liable for copyright infringement.

    This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law.

You can make a copy. If you (the person using the copied work) are using it for something other than private study, scholarship, research, or reproduction beyond "fair use", then you - the person doing that (not the person who made the copy) are liable for infringement.

It would be perfectly legal for me to go to the library and make photocopies of works. I could even take them home and use the photocopies as reference works write an essay and publish that. If {random person} took my photocopied pages and then sold them, that would likely go beyond the limits placed for how the photocopied works from the library may be used.

WillPostForFood1y ago

So what's your specific problem with that? Unless you open a bookstore selling the copies, it sounds fine.

1 more reply

marviel1y ago· 2 in thread

scraping is fine by me.

burning the bridge so nobody else can legally scrape, that's the line.

Vegenoid1y ago

What about the situation where the first players got to scrape, then all the content companies realize what’s going on so they lock their data up behind paywalls?

marviel1y ago

Not a fan, but I'm not sure what can be done.

Assets like the Internet Archive, though, should be protected at all costs.

1 more reply

RIMR1y ago· 1 in thread

>How this is being allowed to happen legally is baffling.

It's completely unprecedented.

We allowed scraping images and text en masse when search engines used the data to let us find stuff.

We allow copying of style, and don't allow writing styles and aesthetics to be copyrighted or trademarked.

Then AI shows up, and people change lanes because they don't like the results.

One of the things that made me tilt towards the side of fair use was a breakdown of the Stable Diffusion model. The SD2.1 base model was trained on 5.85 billion images, all normalized to 512x512 BMP. That's 1MB per images, for a total of 5.85PB of BMP files. The resulting model is only 5.2GB. That's more than 99.999999% data loss from the source data to the trained set.

For every 1MB BMP file in the training dataset, less than 1byte makes it into the model.

I find it extremely difficult to call this redistribution of copyrighted data. It falls cleanly into fair use.

ang_cire1y ago

Except it's not just about redistribution of copyrighted data, it's about usage and obtainment. We don't get to obtain and use copyrighted content without permission, but they do? Hell no.

Their arguments against this amounts to "we're not using it like they intend it to be used, so it's fine if we obtain it illegally", and that's a bs standard, totally divorced from any legal reality.

Fair Use covers certain transformative uses, certainly, but it doesn't cover illegal obtaining of the content.

You can't pirate a book just because you want to use it transformatively (which is exactly what they've done), and that argument would never hold up for us as individuals, so we sure as hell shouldn't let tech companies get a special carve-out for it.

eli1y ago

Copyright law is whatever we agree it is. At some point there will have to be either a law or a court case that comes up with rules for AI training data. Right now it's sort of unknown.

I do not have confidence in the Supreme Court in general, and I think there's a real risk that in deciding on AI training they upend copyright of digital materials in a way that makes it worse for everyone.

brayhite1y ago

A tale as told as time.

AnimalMuppet1y ago

It's too soon for the legal system to have done anything. Court cases take years. It's going to be 5 or 10 years before we find out whether the legal system actually allows this or not.

coding1231y ago

It is more likely that reddit stack and others are just being paid billions. In exchange they probably just send a weekly zip file of all text, comments, etc... back to oai.

avs7331y ago

Uber for legalizing your business model

neycoda1y ago

Honestly every Copilot response I've gotten cited sources, many of which I've clicked. I'd say those work basically like free advertising.

outside12341y ago

There is more money on the side of it being legal than on the side of it being illegal.

FragrantRiver1y ago

What is the crime?

j / k navigate · click thread line to collapse

0 comments

59 comments · 15 top-level

jstummbillig1y ago· 18 in thread

It's not baffling at all. It's unprecedented and it's hugely beneficial to our species.

Given the path it has put as in, people either are insanely cruel or just completely detached from reality when it comes to what is necessary to do entirely new things.

anon77251y ago

> it's hugely beneficial to our species.

Perhaps the biggest “needs citation” statement of our time.

Terr_1y ago

[0] https://www.nature.com/articles/s41586-024-07856-5

3 more replies

50401y ago

Sometimes it seems like problem-solving itself is being problematized as if solving problems wasn't an obvious good.

2 more replies

jstummbillig1y ago

Yes, we are clearly talking about things to mostly still come here. But if you assign a 0 until its a 1 you are just signing out of advancing anything that's remotely interesting.

8 more replies

CamperBob21y ago

dotnet001y ago

bbor1y ago

  The anti-AI stance is what is baffling to me.

And isn’t this what basically every single scholar in the field says they want, anyway - safe, intentional, controlled deployment?

bilekas1y ago

This is a bit of a hot take.

> The anti-AI stance is what is baffling to me

23B11y ago

Ah the old "we must sacrifice the weak for the benefit of humanity" argument, where have I heard this before...

educasean1y ago

Who are the weak being "sacrificed"?

And who is the one calling for action?

Sorry for being dense, but I'm trying to understand if I'm the "strong" or the "weak" in your analogy.

1 more reply

thomascgalvin1y ago

> It's unprecedented and it's hugely beneficial to our species.

"Hugely beneficial" is a stretch at this point. It has the potential to be hugely beneficial, sure, but it also has the potential to be ruinous.

We're already seeing GenAI being used to create disinformation at scale. That alone makes the potential for this being a net-negative very high.

talldayo1y ago

> and obviously nobody could have paid people upfront for the wild experimentation that was necessary.

I don't think this is the "ends justify the means" argument you think it is.

6gvONxR4sf7o1y ago

logicchains1y ago

unclad59681y ago

How has AI benefit or species so far?

educasean1y ago

How has the Internet? How has automobiles? Feels like a rather aimless question.

2 more replies

exe341y ago

xg151y ago

Spoken like a true LLM.

johnwheeler1y ago· 9 in thread

To me this is a no brainer. If it’s a choice between having AI and not,

ceejayoz1y ago

Even if the knock-on effect is "all the artists and thinkers who contributed to the uncompensated free training set give up and stop creating new stuff"?

idunnoman12221y ago

Recording devices, you know a record player had a profound effect on artists. go back

2 more replies

brvsft1y ago

If an "artist" or "thinker" stops because of this, I question their motivations and those labels in the first place.

4 more replies

evilfred1y ago

we already have lots of AI. this is about having plagiarization machines or not.

mlazos1y ago

Computers already were plagiarizing machines, not sure what the difference is tbh. The same laws will apply.0

johnwheeler1y ago

Yeah we got that AI through scraping.

int_19h1y ago

An AI essentially monopolized by one (or even a few) large non-profits is not necessarily beneficial to the rest of us in the grand scheme of things.

brazzy1y ago

Indeed a no brainer. The best possible outcome would be that OpenAI gets sued into oblivion (or shut down for tax fraud) as soon as possible.

Sakos1y ago

So no AI for anybody? I don't see how that's better.

1 more reply

mdgrech231y ago· 5 in thread

The people running the show are well connected and stand to make billions as do would be investors. Give a few key players a share in the company and they forget their government jobs to regulate.

SoftTalker1y ago

They are also moving so much faster than the regulators and legislatures, it's just impossible for people working basically the same way they did in the 19th century to keep up.

barbazoo1y ago

More likely the legal system just hasn’t caught up.

llm_trw1y ago

Maybe, but for the first time in a century there is more money to be made in weakening copyright rather than strengthening it.

4 more replies

rayiner1y ago

xpe1y ago

Copyright concerns are only the tip of the iceberg. Think about the range of other harms and disruptions for countries and the world.

immibis1y ago· 5 in thread

flkenosad1y ago

The Earth needs a good lawyer.

outside12341y ago

NY Times has sued: https://www.nytimes.com/2023/12/27/business/media/new-york-t...

The crazy thing is that there hasn't been an injunction to make them stop.

coding1231y ago

judges got to eat

swores1y ago

Could a class action suit be the solution?

I've no idea if it could be valid when it comes to OpenAI, but it does seem to be a general concept designed to counter wrongdoers who take a little value from a lot of people?

immibis1y ago

It doesn't seem to work very well

golergka1y ago· 4 in thread

If information is publicly available to be read by humans, I fail to see any reason why it wouldn't be also available to be read by robots.

Update: ML doesn't copy information. It can merely memorise some small portions of it.

kanbankaren1y ago

mypalmike1y ago

This metaphor is quite stretched.

shagie1y ago

I would hold them exactly to the same standard.

https://www.copyright.gov/title37/201/37cfr201-14.html

    § 201.14 Warnings of copyright for use by certain libraries and archives.

    ....

    The copyright law of the United States (title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material.

    Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specific conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use,” that user may be liable for copyright infringement.

    This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law.

WillPostForFood1y ago

So what's your specific problem with that? Unless you open a bookstore selling the copies, it sounds fine.

1 more reply

marviel1y ago· 2 in thread

scraping is fine by me.

burning the bridge so nobody else can legally scrape, that's the line.

Vegenoid1y ago

What about the situation where the first players got to scrape, then all the content companies realize what’s going on so they lock their data up behind paywalls?

marviel1y ago

Not a fan, but I'm not sure what can be done.

Assets like the Internet Archive, though, should be protected at all costs.

1 more reply

RIMR1y ago· 1 in thread

>How this is being allowed to happen legally is baffling.

It's completely unprecedented.

We allowed scraping images and text en masse when search engines used the data to let us find stuff.

We allow copying of style, and don't allow writing styles and aesthetics to be copyrighted or trademarked.

Then AI shows up, and people change lanes because they don't like the results.

For every 1MB BMP file in the training dataset, less than 1byte makes it into the model.

I find it extremely difficult to call this redistribution of copyrighted data. It falls cleanly into fair use.

ang_cire1y ago

Except it's not just about redistribution of copyrighted data, it's about usage and obtainment. We don't get to obtain and use copyrighted content without permission, but they do? Hell no.

Their arguments against this amounts to "we're not using it like they intend it to be used, so it's fine if we obtain it illegally", and that's a bs standard, totally divorced from any legal reality.

Fair Use covers certain transformative uses, certainly, but it doesn't cover illegal obtaining of the content.

eli1y ago

Copyright law is whatever we agree it is. At some point there will have to be either a law or a court case that comes up with rules for AI training data. Right now it's sort of unknown.

brayhite1y ago

A tale as told as time.

AnimalMuppet1y ago

It's too soon for the legal system to have done anything. Court cases take years. It's going to be 5 or 10 years before we find out whether the legal system actually allows this or not.

coding1231y ago

It is more likely that reddit stack and others are just being paid billions. In exchange they probably just send a weekly zip file of all text, comments, etc... back to oai.

avs7331y ago

Uber for legalizing your business model

neycoda1y ago

Honestly every Copilot response I've gotten cited sources, many of which I've clicked. I'd say those work basically like free advertising.

outside12341y ago

There is more money on the side of it being legal than on the side of it being illegal.

FragrantRiver1y ago

What is the crime?

j / k navigate · click thread line to collapse