Brazil data regulator bans Meta from mining data to train AI models (opens in new tab)

(apnews.com)

139 pointsemersonrsantos1y ago48 comments

48 comments

This proposal gets made pretty frequently in one form or another, and (at least on HN) seems to usually get struck down on this or that procedural ground.

But as the various regulatory and judicial and legislative processes grind through different parts of the modern intellectual property issue made so abundantly legible by the modern AI training data gold rush it seems ever more clear that one way or another, we’re going to get a new social contract on IP.

Leaving aside for a moment the thicket of laws, precedents, jurisdictions, and regulatory inertia: we can vote with our feet as both customers and contributors for common sense now.

So how about the following compromise: promote innovation by liberalizing the posture around training on roughly “the commons”, but insist that the resulting weights are likewise available to the public. Why do I have to take someone’s word for it that they’ve got a result around superposition or whatever on mech interp? I’d like to see it work given it’s everyone’s data pushing those weights.

I speak only for myself but plenty of people seem to agree: I don’t mind big companies training on generally available data, I mind the IP-laundering. Compete on cost, compete on value-added software stacks, compete on vertical integration. There is lots of money to be made building a better mousetrap in terms of code and infrastructure and product innovation.

Conduct the research in the open. None of this would be possible without an ocean of research and data subsidized in whole or in part by the public. Asserting any form of ownership over the result might end up being legal, but it will never be ethical.

Meta isn’t perfect on this stuff, but they’re by far the actor pulling the conversation in that direction. Let’s encourage them to continue pushing the pace on stuff like LLaMA 3.

luqtas1y ago

> But as the various regulatory and judicial and legislative processes grind through different parts of the modern intellectual property issue made so abundantly legible by the modern AI training data gold rush it seems ever more clear that one way or another, we’re going to get a new social contract on IP.

sitharus1y ago

Yes, that's correct. Under the Berne Convention all copyright for a work and any derivatives is held with the author, unless the explicitly disclaim it or another legal provision applies (eg fair use for teaching or parody).

However, does an LLM count as a derivative work or a transformative one? That's something for the lawyers to answer.

catlikesshrimp1y ago

>> derivative work or a transformative one?

This isn't solved even for humans. There are trials that clear misunderstandings about fair use. (Every developer here has heard of this one: https://en.m.wikipedia.org/wiki/Google_LLC_v._Oracle_America....)

Artificial Intelligence currently has no concept of responsibility (not legal, not ethical), and it will never have existential threats derived from law. The only way that I can think of, as of right now, is that every single product touched by AI must have a human who is legally responsible for it.

rectang1y ago

This has an easy answer — it’s just not the one that people who desperately want to use LLMs for copyright washing want to hear.

The test for what constitutes a derivative work has not changed; it’s the same whether a single human author produced something, or a team of humans, or an LLM. It will be up to a court to decide whether a work is similar enough to be considered derivative.

If an LLM spits out a verbatim copy, that’s obviously infringement. But if the LLM spits out something similar? Well if the LLM spits out something like George Harrison’s My Sweet Lord [1], a court may well decide that it’s derivative of He’s So Fine. Especially if the LLM “subconsciously” “knew” about He’s So Fine because it was part of the training corpus.

[1] https://en.wikipedia.org/wiki/My_Sweet_Lord#Copyright_infrin...

luqtas1y ago

what are the opinions on [0]? what's the scene for language rather than image?

also what are the opinions on turning generative AI (that doesn't ask permission to creators) public domain? donation money that surpasses the cost of hosting the work to people/groups "creating" with AI, should be a violation of the license? are you allowed to play with the models in hardware made by for-profit entities, like Nvidia?

[0] https://arxiv.org/abs/2212.03860

stoniejohnson1y ago

obligatory IANAL, but seeing LLMs:

- regurgitate entire passages word for word, until that behavior is publicized and quickly RLHF'd away

- rip github repos almost entirely (some new Sonnet 3.5 demos Anthropic employees were bragging about on Twitter were basically 1:1 to a person's public repo)

It seems clear to me that not only can copyrighted work be retained and returned in near entirety by the architectures that undergird current frontier models, but the engineers working on these models will readily confuse a model regurgitating work to be "creating novel work".

Osmose1y ago

LLMs are not a general public benefit. Artists whose work is trained upon by text-to-image models aren't made any more whole just because Meta has to share its weights—it just means it's even cheaper for the folks impersonating them or effortlessly ripping off their style to keep doing so.

Meta really does not need to be subsidized when they have so many resources at hand—if LLMs are really hard to train without that much data, then perhaps that's a flaw with the approach instead of something the world has to accommodate.

AlienRobot1y ago

> I don’t mind big companies training on generally available data, I mind the IP-laundering

large platforms saw AI and instantly closed their platforms making it hard or impossible for external actors to mine that "generally available data," hurting their own users and the open web in the process, and then they mined the data themselves.

Art96811y ago

As long as a "user" can access those platforms then that data can and will be mined. The people working on such solutions just dont publish them publicly until the debate is settled. If I can view information on any website, authenticated or not, then I can build a bot that will do the same. This does not mean its being done for nefarious reasons. Simply automating the process of bringing what I consider valuable information to me is enough motivation to do it. In my case, the only profit I make is saving time not manually clicking around to access the data I read over morning coffee.

The internet routes around censorship. Its impossible to hide information as long as its meant to be accessed by a human. If companies want to spend engineering hours putting locks then thats their waste.

Many businesses will fail by wasting time and money creating locks that can and will be circumvented.

I agree that a new social contract is inevitable because the only way to prevent data from being mined is to not produce it to begin with. Period. This I know.

AlienRobot1y ago

When I was young I used to upload fan art of Naruto to DeviantArt. My badly scanned drawings sucked. Everyone else's sucked. It was cool.

Today DeviantArt has its own AI which it promotes over their own users' work. I've read some threads by artists discussing where to go next, between DA, Instagram, ArtStation, and several other new and likely not much better platforms, and one comment that struck me was someone saying it was just not worth it, and their time was better spent networking offline at a gallery.

AI art might actually kill online art communities.

AI-generated articles might kill online publishing.

AI-generated spam bots might kill social media.

We've taken the Internet for granted as grandma and grandpa joined it. Tomorrow people may just get sick of all these algorithms, let go of their smartphones, and go touch some grass. Then every website is just going to be AI bots regurgitating each others' content ad nauseum.

Humans are on the web because of the reach. If AI-generated content steals all the reach, why would anyone post anything on publicly accessible venues instead of just using private ones?

"A new social contract is inevitable because the only way to prevent data from being mined is to not produce it to begin with." But is it though? You are assuming that "not produce it to begin with" is impossible. I'm afraid it's not impossible and the web experiment is at real danger. Maybe not immediately, but will it survive another 20 years in this environment?

tensor1y ago

If I post something publically I'm fine with everyone being able to use it for AI and other data mining. But I'm not ok for a single company only to benefit. And definitely not to sell my public data. I'm looking at you reddit.

1 more reply

diego_sandoval1y ago

> training on roughly “the commons”,

The proposal in the article, however, is not about "the commons", it's about content that the users themselves produced, and then they voluntarily gave permission to Meta to use.

Or are you saying that if I produce some type of material, I shouldn't be able to license it for someone else to use it freely?

jononor1y ago

Voluntarily gave permission? Meta never asked anyone for permission to use the AI as training data! They just opted in every single user, via an update of their privacy policy. In order to opt out you have to discover that this is happening, find the help page with the protest form, write some prose about why this negatively affects you, and hope they acknowledge this. I have done this, and my request has not even been answered.

hansvm1y ago

> and then they voluntarily gave permission to Meta to use

That's a funny way to describe DNT headers, disallowed Meta cookies, DNS blocking all their domains, and maintaining copyright over my content.

ronsor1y ago

Reforming the social contract around IP became infinitely difficult the moment normal people started calling it "intellectual property", forming a tangled mess of legal, moral, and ethical ideas in most people's minds.

Your compromise is exactly the situation I desire but seems untenable to most people.

benatkin1y ago

> I speak only for myself but plenty of people seem to agree: I don’t mind big companies training on generally available data, I mind the IP-laundering.

This removes the big scary emotional part of the debate. Without this, it's weakened quite a bit.

__loam1y ago

A lot of that "commons" was the work of people who gave absolutely no permission for their data to be used like this, that is, creating a machine intended to compete with them. My hope is we collectively tell these freeloading ass holes to pound sand. Public does not mean limitless license.

golergka1y ago

> So how about the following compromise: promote innovation by liberalizing the posture around training on roughly “the commons”, but insist that the resulting weights are likewise available to the public.

How much would you personally invest in a startup which would spend billions of dollars on a compute cluster only to release the weights publicly after the training is complete?

jampekka1y ago

I'm gladly investing portion of my wages to such efforts, alongside e.g. education, healthcare, childcare and infrastructure. And I don't even want any monetary ROI from my investment!

golergka1y ago

So your plan is to regulate it into complete commercial unviability, where the only source of funding were government bureaucrats? How often does this strategy pay off?

1 more reply

mistrial91y ago

think of the Investors !

golergka1y ago

I really appreciate that retired teachers and miners have something to live on, yes.

__loam1y ago

He's making a good point. 100m was invested in stability. They are out of money with no clear path to product market fit after giving their model away for free. I'm personally ecstatic that they failed but it is worth asking about the value this technology is producing compared to the cost.

Cheer21711y ago

> Compliance must be demonstrated by the company within five working days from the notification of the decision, and the agency established a daily fine of 50,000 reais ($8,820) for failure to do so.

$8,820 * 365 = $3.2 million a year is pretty cheap for Meta to be able to do whatever they want with all the data from all 200 million Brazilians. Their annual net income is $39.10 billion, so 0.008%.

alganet1y ago

This is the heads up fine _just_ to update the privacy policy. It's a mere warning.

The fine for each privacy infraction is 2% of the company's last year earnings, limited to 50 million BRL (~9 million USD). If 500 brazillians had their privacy violated by a platform, that platform needs to pay 500 of these fines once per day until it is fixed. There's also all sorts of extra punishments for not fixing it in time (like mandatory suspension of services).

Facebook is not forbidden to use your data for AI. It can do so, as long as it provide means for you to delete it. A button to clean your data, for example. That would be legal. We know for LLMs is not that easy though.

Waterluvian1y ago

It’s not a Blockbuster Video. They’ll eventually increase the penalty for noncompliance or escalate the kind of punishment.

LoganDark1y ago

Are there any cases of this actually happening?

diegoholiveira1y ago

with half the fine, Meta buys the entire Brazilian supreme court and suddenly: no fines, no jail and everyone will be happy.

Nition1y ago

I hereby also ban Facebook from accessing my data to train AI. I'm sure it would be a big hassle to carefully exclude me from all their models, but I'm generously offering an alternative non-compliance fine of only $50/day, paid to my bank account.

vitorgrs1y ago

Pretty sure that's per user (usually the other fines they already did, were like that)

tiahura1y ago

Information wants to be free. The ethos of the open web - the levy hacker ethos, has always been about unrestricted access and fair use. When content is published openly online, it inherently invites broad consumption, reproduction, and creative reuse by the public. This principle is deeply rooted in the fair use doctrine as applied to the digital realm.

Fair use is evaluated based on the purpose of use, the nature of the copyrighted work, the amount used, and the effect on the market. These factors generally favor the free use of openly published web content. The transformative nature of many reuses, the public availability of original works, the necessity of using entire works in some cases, and the absence of a traditional market for such content all support this interpretation.

This longstanding practice has driven unprecedented innovation and information dissemination, establishing a social contract between content creators and users that treats open web content as "freeware." Any move to impose strict copyright limitations now would stifle innovation and contradict decades of established legal precedent and digital norms.

delichon1y ago

So we're in for an eternal cat and mouse game where AIs attempt to learn all the facts available and obscure their provenance as needed to evade restrictions, and IP owners attempt to prove that AIs know too much and therefore owe them money.

1 more reply

nostromo1y ago

What problem does this solve?

The article only mentions that data could be used to train AI to make CSAM... which seems needlessly alarmist and inflammatory.

cassianoleal1y ago

I don't see any mention of CSAM in the article. OTOH, this is how the third paragraph reads verbatim:

> The decision stems from “the imminent risk of serious and irreparable or difficult-to-repair damage to the fundamental rights of the affected data subjects,” the agency said in the nation’s official gazette.

throwaway9571y ago

(moderator: please, don’t delete this comment again, everybody is commenting without knowing what Meta did)

Meta spared no expenses to hide the opt-out page. The agency says that: “there were excessive and unjustified obstacles to accessing information and exercising this right”. This was one of the main reasons that obligated the agency to act.

The steps to get to the hidden opt-out page are bellow, obligating users to read the privacy policy to find a link buried deep down in the text, and requiring 2FA by email to opt out even for already logged in users - they should require 2FA to log in, not to opt out of AI training. There is no justification to require all this:

* Access your profile and go to the settings section, signaled by three bars in the top right corner

* Click on "about" at the bottom of the page

* Select the privacy policy. On this new page, the three bars in the top right corner lead to the privacy center

* Click on the arrow next to other policies and articles and select the option "How Meta uses information for generative AI features and models"

* In the nineteenth paragraph, not counting topics, is the "right to object" option. Click on it.

* Fill in and send the form. Meta confirms your identity with a numerical code sent to the email address registered on your account. Then just wait for the opt-out to be confirmed. This can take a few minutes.

yallpendantools1y ago

Honestly, I'm rather frustrated by the HN discourse on this topic.

TFA (with emphasis added):

> Brazil’s national *data protection* authority determined on Tuesday that Meta, the parent company of Instagram and Facebook, cannot use data originating in the country to train its artificial intelligence.

https://www.theregister.com/2024/06/14/meta_eu_privacy/ (with emphasis added):

> The decision to halt AI training using EU content follows complaints to *data protection* agencies in 11 European countries – and those agencies, led by Ireland, telling the Facebook giant to scrap the slurp.

While there is no shortage of IP, licensing, and copyright moral quandaries in training LLMs and their ilk, Meta/FB is not getting regulated on those grounds! They are getting regulated on privacy issues. It's even there on The Register path.

I'm seeing a lot of comments in these threads about IP, copyright, and licensing---which, please do take note, are well-defined legal terms and are not to be used interchangeably---but all that is irrelevant because that is not the question Meta is being made to answer for.

Even more frustrating are threads/arguments to what "irrevocable (copy)rights" you give FB per their TOS without even bothering to cite the relevant bits of the TOS to prove their point. Exercise to the reader: prove/disprove that [a] FB users retain copyright of their content even when posted to FB and [b] you are merely licensing FB to specific (not universal!) uses of your content posted in their platform and [c] said license is revocable any time. The astute reader is referred to the Berne Convention but Facebook's TOS will also do just fine. Standard question, one point per answer.

Bonus point question: if you have proven the points above, what action allows you to revoke the license you have granted FB?

(Of course, end of the day, I'm again playing lawyer in an online forum. I'm no better than anyone else here what do I know.)

fcanesin1y ago

Meta updated its Privacy Policy on June 26, to include in its rights the use of data collected in "Meta Products" for the training of GenAI models. This goes against the interpretation of the local data protection law and as such this note was emitted.

The policy update seems to be global: https://www.facebook.com/privacy/policy

tensor1y ago

Personally I'd rather they force these companies to provide a reasonably priced alternative to ads. I want to have the option to allow companies to use my data for AI. I think AI can be a net boon to society if managed right.

But ads are net negative and I'd argue that the influence of ads and paid actors on social media has been the single most destabilizing force in the world recently.

cmpyl01y ago

Link to the decision: https://www.gov.br/anpd/pt-br/assuntos/noticias/anpd-determi... in brazilian portuguese.

It is only for Meta, but I think it is because it caught the regulator's attention. The ban is due to lack of legal basis for the change in Meta's privacy policy regarding LGPD (brazilian privacy law - https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei...)

CSMastermind1y ago

Meta specifically or all social networks? If it's just Meta I'm curious of why.

29athrowaway1y ago

Well, joke's on you Brazil. I doubt Meta can mine any more data than they already have.

They already have all the names, pictures, face biometrics, social graph, location information, political affiliation, relationships and everything that goes into an advertising profile. What else is needed?

They can just use the data already mined, which is probably 99% of everything they will ever need for many years to come. They probably have so much data they can use AI to predict what's missing with a fairly degree of accuracy, like what your face is going to look like in 20 years.

And due to the highly corrupt nature of politics, a few dollars here and there will undo this regulation fairly quickly. Or they could buy it from another company, because the law must be so poorly constructed that a clever lawyer will surely find workaround and they will be OK.

dylan6041y ago

How do you prove that an AI=>LLM has been trained on specific data? How is this enforceable? How is this anything more than some local politicians looking for some headlines?

29athrowaway1y ago

That too. They never get audited and if they do, it's in another jurisdiction.

elzbardico1y ago

Who would predict that Butlerian Jihad would start from all places, in Brazil?

mastercheph1y ago

Somebody forgot to pay their bribe!

j / k navigate · click thread line to collapse

48 comments

benreesman1y ago

This proposal gets made pretty frequently in one form or another, and (at least on HN) seems to usually get struck down on this or that procedural ground.

Leaving aside for a moment the thicket of laws, precedents, jurisdictions, and regulatory inertia: we can vote with our feet as both customers and contributors for common sense now.

Meta isn’t perfect on this stuff, but they’re by far the actor pulling the conversation in that direction. Let’s encourage them to continue pushing the pace on stuff like LLaMA 3.

luqtas1y ago

sitharus1y ago

However, does an LLM count as a derivative work or a transformative one? That's something for the lawyers to answer.

catlikesshrimp1y ago

>> derivative work or a transformative one?

rectang1y ago

This has an easy answer — it’s just not the one that people who desperately want to use LLMs for copyright washing want to hear.

[1] https://en.wikipedia.org/wiki/My_Sweet_Lord#Copyright_infrin...

luqtas1y ago

what are the opinions on [0]? what's the scene for language rather than image?

[0] https://arxiv.org/abs/2212.03860

stoniejohnson1y ago

obligatory IANAL, but seeing LLMs:

- regurgitate entire passages word for word, until that behavior is publicized and quickly RLHF'd away

- rip github repos almost entirely (some new Sonnet 3.5 demos Anthropic employees were bragging about on Twitter were basically 1:1 to a person's public repo)

Osmose1y ago

AlienRobot1y ago

> I don’t mind big companies training on generally available data, I mind the IP-laundering

Art96811y ago

Many businesses will fail by wasting time and money creating locks that can and will be circumvented.

I agree that a new social contract is inevitable because the only way to prevent data from being mined is to not produce it to begin with. Period. This I know.

AlienRobot1y ago

When I was young I used to upload fan art of Naruto to DeviantArt. My badly scanned drawings sucked. Everyone else's sucked. It was cool.

AI art might actually kill online art communities.

AI-generated articles might kill online publishing.

AI-generated spam bots might kill social media.

Humans are on the web because of the reach. If AI-generated content steals all the reach, why would anyone post anything on publicly accessible venues instead of just using private ones?

tensor1y ago

1 more reply

diego_sandoval1y ago

> training on roughly “the commons”,

The proposal in the article, however, is not about "the commons", it's about content that the users themselves produced, and then they voluntarily gave permission to Meta to use.

Or are you saying that if I produce some type of material, I shouldn't be able to license it for someone else to use it freely?

jononor1y ago

hansvm1y ago

> and then they voluntarily gave permission to Meta to use

That's a funny way to describe DNT headers, disallowed Meta cookies, DNS blocking all their domains, and maintaining copyright over my content.

ronsor1y ago

Your compromise is exactly the situation I desire but seems untenable to most people.

benatkin1y ago

> I speak only for myself but plenty of people seem to agree: I don’t mind big companies training on generally available data, I mind the IP-laundering.

This removes the big scary emotional part of the debate. Without this, it's weakened quite a bit.

__loam1y ago

golergka1y ago

How much would you personally invest in a startup which would spend billions of dollars on a compute cluster only to release the weights publicly after the training is complete?

jampekka1y ago

I'm gladly investing portion of my wages to such efforts, alongside e.g. education, healthcare, childcare and infrastructure. And I don't even want any monetary ROI from my investment!

golergka1y ago

So your plan is to regulate it into complete commercial unviability, where the only source of funding were government bureaucrats? How often does this strategy pay off?

1 more reply

mistrial91y ago

think of the Investors !

golergka1y ago

I really appreciate that retired teachers and miners have something to live on, yes.

__loam1y ago

Cheer21711y ago

> Compliance must be demonstrated by the company within five working days from the notification of the decision, and the agency established a daily fine of 50,000 reais ($8,820) for failure to do so.

alganet1y ago

This is the heads up fine _just_ to update the privacy policy. It's a mere warning.

Waterluvian1y ago

It’s not a Blockbuster Video. They’ll eventually increase the penalty for noncompliance or escalate the kind of punishment.

LoganDark1y ago

Are there any cases of this actually happening?

diegoholiveira1y ago

with half the fine, Meta buys the entire Brazilian supreme court and suddenly: no fines, no jail and everyone will be happy.

Nition1y ago

vitorgrs1y ago

Pretty sure that's per user (usually the other fines they already did, were like that)

tiahura1y ago

delichon1y ago

1 more reply

nostromo1y ago

What problem does this solve?

The article only mentions that data could be used to train AI to make CSAM... which seems needlessly alarmist and inflammatory.

cassianoleal1y ago

I don't see any mention of CSAM in the article. OTOH, this is how the third paragraph reads verbatim:

throwaway9571y ago

(moderator: please, don’t delete this comment again, everybody is commenting without knowing what Meta did)

* Access your profile and go to the settings section, signaled by three bars in the top right corner

* Click on "about" at the bottom of the page

* Select the privacy policy. On this new page, the three bars in the top right corner lead to the privacy center

* Click on the arrow next to other policies and articles and select the option "How Meta uses information for generative AI features and models"

* In the nineteenth paragraph, not counting topics, is the "right to object" option. Click on it.

yallpendantools1y ago

Honestly, I'm rather frustrated by the HN discourse on this topic.

TFA (with emphasis added):

https://www.theregister.com/2024/06/14/meta_eu_privacy/ (with emphasis added):

Bonus point question: if you have proven the points above, what action allows you to revoke the license you have granted FB?

(Of course, end of the day, I'm again playing lawyer in an online forum. I'm no better than anyone else here what do I know.)

fcanesin1y ago

The policy update seems to be global: https://www.facebook.com/privacy/policy

tensor1y ago

But ads are net negative and I'd argue that the influence of ads and paid actors on social media has been the single most destabilizing force in the world recently.

cmpyl01y ago

Link to the decision: https://www.gov.br/anpd/pt-br/assuntos/noticias/anpd-determi... in brazilian portuguese.

CSMastermind1y ago

Meta specifically or all social networks? If it's just Meta I'm curious of why.

29athrowaway1y ago

Well, joke's on you Brazil. I doubt Meta can mine any more data than they already have.

dylan6041y ago

How do you prove that an AI=>LLM has been trained on specific data? How is this enforceable? How is this anything more than some local politicians looking for some headlines?

29athrowaway1y ago

That too. They never get audited and if they do, it's in another jurisdiction.

elzbardico1y ago

Who would predict that Butlerian Jihad would start from all places, in Brazil?

mastercheph1y ago

Somebody forgot to pay their bribe!

j / k navigate · click thread line to collapse