undefined | Better HN

0 pointssuncemoje8d ago0 comments

Then I'm interested if there are any facts as to what ZDR actually means?

0 comments

9 comments · 1 top-level

jaapz8d ago· 8 in thread

It can still mean Zero Data Retention - i just comes down to whether you trust the company to actually do what they promise.

The fact that they've trained models on data that wasn't theirs does not make me trust them a lot when they make this claim.

munksbeer8d ago

When discussing this, may I ask (I know you are probably bored of the actual arguments), what does "trained models on data that wasn't theirs" actually mean in practice?

Again, I know these arguments have been done to death, but every human who reads source code that wasn't written by them, or views art that wasn't created by them, and practices against this art, is training their brain on data "that wasn't theirs".

They are frequently making a living doing so.

Is this distinction the scale, or is there actually a different more strict definition that we should be using as a common language to talk about this? As in, I should not even be reading certain source code if it is not licensed appropriate, or I will be in breach because I'm training myself illegally? And the same question for art, etc?

VBprogrammer8d ago

In general humans don't have perfect recall. Even people with what we might call a photographic memory don't have the ability to memorise millions of lines of code and output them with little effort.

It hinges somewhat on the concept of how much you believe things are being learned and how much is just pattern matching and borrowing a solution from memory. Certainly in the early days of Copilot it was possible to get it to output chunks of open source code near verbatim.

I think, generally, people are probably closer to believing that there is some kind of reasoning being carried out by these models than in those early days but it would also be easy to strip all of the immediately identifiable comments etc from the training materials to make it harder to detect.

antonvs8d ago

> how much is just pattern matching and borrowing a solution from memory.

It's easy to show that this is not the case. This is a well-known phenomenon in ML, known as generalization - specifically, compositional generalization. See e.g. https://research.google/blog/measuring-compositional-general... for a description - although note that that post is from 2020, and models have become much better at this since then.

People can "believe" what they want, but there's plenty of work that definitively falsifies beliefs about "borrowing a solution from memory".

iwontberude8d ago

If it outputs copyrighted material, which it does handily, then it doesn’t really matter.

dpoloncsak8d ago

A product is not a human. They are selling a product based off copy-righted material without the rights to it. It's a pretty easy line to draw, honestly.

munksbeer7d ago

I don't think it is easy, otherwise this wouldn't be such a contentious and frequently discussed issue.

A human who trains their brain on material they don't own then creates art or writes code based on this training, sells the product too.

1 more reply

XMPPwocky8d ago

Why is something being human or not relevant here?

1 more reply

glitchc8d ago

Once they feed your data into the training dataset, they can delete the individualized copy. The training dataset is, of course, a trade secret that can never be exposed without causing serious harm to the company's model, or equivalent legalese that will prevent it's disclosure to all, governments included.

j / k navigate · click thread line to collapse

0 comments

9 comments · 1 top-level

jaapz8d ago· 8 in thread

It can still mean Zero Data Retention - i just comes down to whether you trust the company to actually do what they promise.

The fact that they've trained models on data that wasn't theirs does not make me trust them a lot when they make this claim.

munksbeer8d ago

When discussing this, may I ask (I know you are probably bored of the actual arguments), what does "trained models on data that wasn't theirs" actually mean in practice?

They are frequently making a living doing so.

VBprogrammer8d ago

In general humans don't have perfect recall. Even people with what we might call a photographic memory don't have the ability to memorise millions of lines of code and output them with little effort.

antonvs8d ago

> how much is just pattern matching and borrowing a solution from memory.

People can "believe" what they want, but there's plenty of work that definitively falsifies beliefs about "borrowing a solution from memory".

iwontberude8d ago

If it outputs copyrighted material, which it does handily, then it doesn’t really matter.

dpoloncsak8d ago

A product is not a human. They are selling a product based off copy-righted material without the rights to it. It's a pretty easy line to draw, honestly.

munksbeer7d ago

I don't think it is easy, otherwise this wouldn't be such a contentious and frequently discussed issue.

A human who trains their brain on material they don't own then creates art or writes code based on this training, sells the product too.

1 more reply

XMPPwocky8d ago

Why is something being human or not relevant here?

1 more reply

glitchc8d ago

j / k navigate · click thread line to collapse