undefined | Better HN

Exactly. Training data is key for any trained AI system. And I strongly suspect that every single company active in these modern AI systems is still struggling with how to tune their training data. It's easy to get something out of it, but it's hard to control what you get out of it.

cainxinth2y ago

> We don't know since no one is releasing their data.

Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?

JohnFen2y ago

I think it's really, really clear that the majority of the data used to train all of these things was used without permission.

Zambyte2y ago

The model is open weight, which is less useful than open source, but more useful than fully propriety (akin to the executable binaries you compare to)

boulos2y ago

How about "weights available" as similar to the "source available" moniker?

fragmede2y ago

weights available or model available, but yes.

drexlspivey2y ago

Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?

minimaxir2y ago

Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.

llm_trw2y ago

Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."

1 more reply

fragmede2y ago

that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.

rezonant2y ago

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

convery2y ago

The X algorithm is also opensource, so you can verify before commenting..

2 more replies

nonethewiser2y ago

> I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

Are you sure or is it the literal opposite and you’re just speculating?

jakderrida2y ago

Aren't they usually built on most of the same training data?

GaggiX2y ago

Or even how much it was trained on this dataset, the amount of FLOPs.

j / k navigate · click thread line to collapse

0 comments

llm_trw2y ago

We don't know since no one is releasing their data.

Calling these models open source is like calling a binary open source because you can download it.

Which in this day and age isn't far from where were at.

DreamGen2y ago

A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.

llm_trw2y ago

You can also build on top of binaries if you use gotos and machine code.

2 more replies

tarruda2y ago

You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.

1 more reply

swalsh2y ago

We should just call it open weight models at this point.

cl3misch2y ago

FWIW the Grok repo uses the term "open weights".

mcv2y ago

cainxinth2y ago

> We don't know since no one is releasing their data.

Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?

JohnFen2y ago

I think it's really, really clear that the majority of the data used to train all of these things was used without permission.

Zambyte2y ago

The model is open weight, which is less useful than open source, but more useful than fully propriety (akin to the executable binaries you compare to)

boulos2y ago

How about "weights available" as similar to the "source available" moniker?

fragmede2y ago

weights available or model available, but yes.

drexlspivey2y ago

Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?

minimaxir2y ago

Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.

llm_trw2y ago

Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."

1 more reply

fragmede2y ago

that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.

rezonant2y ago

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

convery2y ago

The X algorithm is also opensource, so you can verify before commenting..

2 more replies

nonethewiser2y ago

> I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

Are you sure or is it the literal opposite and you’re just speculating?

jakderrida2y ago

Aren't they usually built on most of the same training data?

GaggiX2y ago

Or even how much it was trained on this dataset, the amount of FLOPs.

j / k navigate · click thread line to collapse