undefined | Better HN

0 pointsechelon10mo ago0 comments

Open source is a crazy new beast in the AI/ML world.

We have numerous artifacts to reason about:

- The model code

- The training code

- The fine tuning code

- The inference code

- The raw training data

- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)

- The resultant weights

- The inference outputs (which also need a license)

- The research papers (hopefully it's described in literature!)

- The patents (or lack thereof)

The term "open source" is wholly inadequate here. We need a 10-star grading system for this.

This is not your mamma's C library.

AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).

This is more than enough to distill new models from.

Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.

0 comments

reedciccio10mo ago

Https://opensource.org/ai ... Lots of reasoning has been done on those artifacts

Tepix10mo ago

I think you‘re trying to make it look more complex than it is. Put the amount of data next to every entry in that list of yours.

echelonOP10mo ago

Most of those items map to a job description.

If you think the data story isn't a complicated beast, then consider:

If you wanted an "open" dataset, would you want it before or after it was processed? There are a lot of cleaning, categorizing, feature extraction steps. The data typically undergoes a lot of analysis, extra annotation, bucketing, and transformation.

If the pre-train was done in stages, and the training process was complicated, how much hand-holding do you need to replicate that process?

Do you need all of the scripts to assist with these processes? All of the infra and MLOps pieces? There's a lot of infrastructure to just move the data around and poke it.

Where are you going to host those terabytes or petabytes of data? Who is going to download it? How often? Do you expect it to be downloaded as frequently as the Linux kernel sources?

Did you scrub it of PII? Are you sure?

And to clarify, we're not even talking about trained models at this point.

xnickb10mo ago

I'd argue we don't need a 10 star system. The single bit we have now is enough. And the question is also pretty clear: did $company steal other peoples work?

The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics

selfhoster1110mo ago

We use pop-cultural references to communicate all the time these days. Those don't necessarily come from only the most commonly known sections of these works, so the AI would necessarily need the full work (or a functional transformation of the work) to be able to hit the theoretical maximum of the ability to decode about and reason using such references. To exclude copyrighted works from the training set is to expect it to decode from the outside what amounts to humanity's own in-group jokes.

That's my formal argument. The less formal one is that copyright protection is something that smaller artists deserve more than rich conglomerates, and even then, durations shouldn't be "eternity and a day". A huge chunk of what is being "stolen" should be in the commons anyway.

yencabulator10mo ago

"Your honor, if I hadn't robbed that bank I wouldn't have gotten all that money!"

echelonOP10mo ago

I truthfully cannot think of a single model that satisfies your criteria.

And if we wait for the the internet to be wholly eaten by AI, if we accept perfect as the enemy of good, then we'll have nothing left to cling to.

> And the question is also pretty clear: did $company steal other peoples work?

Who the hell cares? By the time this is settled - and I'd argue you won't get a definitive agreement - the internet will be won by the hyperscalers.

Accept corporate gifts of AI, and keep pushing them forward. Commoditize. Let there be no moat.

There will be infinite synthetic data available to us in the future anyway. And none of this bickering will have even mattered.

j / k navigate · click thread line to collapse

0 pointsechelon10mo ago0 comments

Open source is a crazy new beast in the AI/ML world.

We have numerous artifacts to reason about:

- The model code

- The training code

- The fine tuning code

- The inference code

- The raw training data

- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)

- The resultant weights

- The inference outputs (which also need a license)

- The research papers (hopefully it's described in literature!)

- The patents (or lack thereof)

The term "open source" is wholly inadequate here. We need a 10-star grading system for this.

This is not your mamma's C library.

AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).

This is more than enough to distill new models from.

0 comments

reedciccio10mo ago

Https://opensource.org/ai ... Lots of reasoning has been done on those artifacts

Tepix10mo ago

I think you‘re trying to make it look more complex than it is. Put the amount of data next to every entry in that list of yours.

echelonOP10mo ago

Most of those items map to a job description.

If you think the data story isn't a complicated beast, then consider:

If the pre-train was done in stages, and the training process was complicated, how much hand-holding do you need to replicate that process?

Do you need all of the scripts to assist with these processes? All of the infra and MLOps pieces? There's a lot of infrastructure to just move the data around and poke it.

Where are you going to host those terabytes or petabytes of data? Who is going to download it? How often? Do you expect it to be downloaded as frequently as the Linux kernel sources?

Did you scrub it of PII? Are you sure?

And to clarify, we're not even talking about trained models at this point.

xnickb10mo ago

I'd argue we don't need a 10 star system. The single bit we have now is enough. And the question is also pretty clear: did $company steal other peoples work?

The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics

selfhoster1110mo ago

yencabulator10mo ago

"Your honor, if I hadn't robbed that bank I wouldn't have gotten all that money!"

echelonOP10mo ago

I truthfully cannot think of a single model that satisfies your criteria.

And if we wait for the the internet to be wholly eaten by AI, if we accept perfect as the enemy of good, then we'll have nothing left to cling to.

> And the question is also pretty clear: did $company steal other peoples work?

Who the hell cares? By the time this is settled - and I'd argue you won't get a definitive agreement - the internet will be won by the hyperscalers.

Accept corporate gifts of AI, and keep pushing them forward. Commoditize. Let there be no moat.

There will be infinite synthetic data available to us in the future anyway. And none of this bickering will have even mattered.

j / k navigate · click thread line to collapse