undefined | Better HN

story

0 pointstensor1y ago0 comments

That's not correct. First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually). But secondly, and more to your point, even if you were to use training data from another model, YOU STILL NEED TO DO ALL THE TRAINING.

Using data from another model won't save you any training time.

0 comments

dragonwriter1y ago

> training off of data generated by another AI is generally a bad idea

It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.

It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.

LPisGood1y ago

It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.

This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.

acgourley1y ago

This is not obvious to me! For example, if you locked me in a room with no information inputs, over time I may still become more intelligent by your measures. Through play and reflection I can prune, reconcile and generate. I need compute to do this, but not necessarily more knowledge.

sudosysgen1y ago

Again, this isn't how distillation work. Your task as the distillation model is to copy mistakes, and you will be penalized by pruning reconciling and generating.

"Play and reflection" is something else, which isn't distillation.

1 more reply

mattnewton1y ago

LLMs are no longer trying to just reproduce the distribution of online text as a whole to push the state of the art, they are focused on a different distribution of “high quality” - whatever that means in your domain. So it is possible that this process matches a “better” distribution for some tasks by removing erroneous information or sampling “better” outputs more frequently.

highfrequency1y ago

While that is theoretically true, it misses everything interesting (kind of like the No Free Lunch Theorem, or the VC dimension for neural nets). The key is that the parent model may have been trained on a dubious objective like predicting the next word of randomly sampled internet text - not because this is the objective we want, but because this is the only way to get a trillion training points.

Given this, there’s no reason why it could not be trivial to produce a child model from (filtered) parent output that exceeds the child model on a different, more meaningful objective like being a useful chatbot. There's no reason why this would have to be limited to domains with verifiable answers either.

esafak1y ago

The latest models create information from base models by randomly creating candidate responses then pruning the bad ones using an evaluation function. The good responses improve the model.

It is not distillation. It's like how you can arrive at new knowledge by reflecting on existing knowledge.

kybernetikos1y ago

Fine tuning an llm on the output of another llm is exactly how deepseek made its progress. The way they got around the problem you describe is by doing this in a domain that can be relatively easily checked for correctness, so suggested training data for fine tuning could be automatically filtered out if it was wrong.

dragonwriter1y ago

> It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.

Unfiltered? Sure. With human curation of the generated data it certainly can. (Even automated curation can do this, though its more obvious that human curation can.)

I mean, I can randomly developed fact claims about addition, and if I curate which ones go into a training set, train a model that reflects addition of integers much more accurately than the random process which generated the pre-curation input data.

Without curation, as I already said, the best you get is a distillation of the source model, which is highly improbable to be more accurate.

Jerrrry1y ago

No one knows if the pigeon-hole principle applies absolutely exclusive to the ability to generalize outside of a training set.

That is the existential, $1T question.

FridgeSeal1y ago

No no no you don’t understand, the models will magically overcome issues and somehow become 100x and do real AGI! Any day now! It’ll work because LLM’s are basically magic!

Also, can I have some money to build more data centres pls?

gitaarik1y ago

So 1 + 1 = 3?

bbor1y ago

I think you're missing the point being made here, IMHO: using an advanced model to build high quality training data (whatever that means for a given training paradigm) absolutely would increase the efficiency of the process. Remember that they're not fighting over sounding human, they're fighting over deliberative reasoning capabilities, something that's relatively rare in online discourse.

Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!

tensorOP1y ago

It's trivial to build synthetic reasoning datasets, likely even in natural languages. This is a well established technique that works (e.g. see Microsoft Phi, among others).

I said generally because there are things like adversarial training that use a ruleset to help generate correct datasets that work well. Outside of techniques like that it's not just a rule of thumb, it's always true that training on the output of another model will result in a worse model.

https://www.scientificamerican.com/article/ai-generated-data...

numba8881y ago

> it's always true that training on the output of another model will result in a worse model.

Not convincing.

You can imagine model doing some primitive thinking and coming to conclusion. Then you can train another model on summaries. If everything goes well it will be coming to conclusions quicker. That's at least. Or it may be able solve more complex problems with the same amount of 'thinking'. It will be self-propelled evolution.

Another option is to use one model to produce 'thinking' part from known outputs. Then train another to reproduce thinking to get the right output, unknown to it initially. Using humans to create such dataset would be slow and very expensive.

PS: if it was impossible humans would be still living on the trees.

tensorOP1y ago

Humans don't improve by "thinking." They improve my natural selection against a fitness function. If that fitness function is "doing better at math" then over a long time perhaps humans will get better at math.

These models don't evolve like they, there is not a random process of architectural evolution. Nor is there a fitness function anything like "get better at math."

A system like AlphaZero works because it has a rules to use as an oracle: the game rules. The game rules provide the new training information needed drive the process. Each game played produces new correct training data.

These LLMs have no such oracle. Their fitness function is and remains: predict the next word, followed by: produce text that makes a human happy. Note that it's not "produce text that makes ChatGPT happy."

1 more reply

smitelli1y ago

> training off of data generated by another AI is generally a bad idea

Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.

I look forward to that day.

bangaladore1y ago

That's already happened. Its well established now that the internet is tainted. After essentially ChatGPT's public release, a non-insignificant amount of internet content is not written by humans.

tensorOP1y ago

Yes, this is a real and serious concern that AI researchers have.

fumeux_fume1y ago

I think the point is that if R1 isn't possible without access to OpenAI (at low, subsidized costs) then this isn't really a breakthrough as much as a hack to clone an existing model.

bbor1y ago

R1 is--as far as we know from good ol' ClosedAI--far more efficient. Even if it were a "clone", A) that would be a terribly impressive achievement on its own that Anthropic and Google would be mighty jealous of, and B) it's at the very least a distillation of O1's reasoning capabilities into a more svelte form.

tensorOP1y ago

The training techniques are a breakthrough no matter what data is used. It's not up for debate, it's an empirical question with a concrete answer. They can and did train orders of magnitude faster.

blast1y ago

Not arguing with your point about training efficiency, but the degree to which R1 is a technical breakthrough changes if they were calling an outside API to get the answers, doesn't it?

It seems like the difference between someone doing a better writeup of (say) Wiles's proof vs. proving Fermat's Last Theorem independently.

pests1y ago

That outside API used to be humans, doing the work manually. Now we have ways to speed that up.

athrowaway3z1y ago

Thats not right either.

It proofs we _can_ optimize our training data.

Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.

sailingparrot1y ago

> First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually).

That is not true at all.

We have known how to solve this for at least 2 years now.

All the latest state of the art models depend heavily on training on synthetic data.

bjourne1y ago

https://www.nature.com/articles/s41586-024-07566-y

sailingparrot1y ago

Key point from your linked paper:

> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models

No one is training on indiscriminate synthetic data. It's very much discriminated, but still synthetic.

jjallen1y ago

The DS R1 Model is slightly better though. So how does your statement square with that?

j / k navigate · click thread line to collapse

0 comments

dragonwriter1y ago

> training off of data generated by another AI is generally a bad idea

It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.

LPisGood1y ago

This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.

acgourley1y ago

sudosysgen1y ago

Again, this isn't how distillation work. Your task as the distillation model is to copy mistakes, and you will be penalized by pruning reconciling and generating.

"Play and reflection" is something else, which isn't distillation.

1 more reply

mattnewton1y ago

highfrequency1y ago

esafak1y ago

The latest models create information from base models by randomly creating candidate responses then pruning the bad ones using an evaluation function. The good responses improve the model.

It is not distillation. It's like how you can arrive at new knowledge by reflecting on existing knowledge.

kybernetikos1y ago

dragonwriter1y ago

Unfiltered? Sure. With human curation of the generated data it certainly can. (Even automated curation can do this, though its more obvious that human curation can.)

Without curation, as I already said, the best you get is a distillation of the source model, which is highly improbable to be more accurate.

Jerrrry1y ago

No one knows if the pigeon-hole principle applies absolutely exclusive to the ability to generalize outside of a training set.

That is the existential, $1T question.

FridgeSeal1y ago

No no no you don’t understand, the models will magically overcome issues and somehow become 100x and do real AGI! Any day now! It’ll work because LLM’s are basically magic!

Also, can I have some money to build more data centres pls?

gitaarik1y ago

So 1 + 1 = 3?

bbor1y ago

Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!

tensorOP1y ago

It's trivial to build synthetic reasoning datasets, likely even in natural languages. This is a well established technique that works (e.g. see Microsoft Phi, among others).

https://www.scientificamerican.com/article/ai-generated-data...

numba8881y ago

> it's always true that training on the output of another model will result in a worse model.

Not convincing.

PS: if it was impossible humans would be still living on the trees.

tensorOP1y ago

These models don't evolve like they, there is not a random process of architectural evolution. Nor is there a fitness function anything like "get better at math."

1 more reply

smitelli1y ago

> training off of data generated by another AI is generally a bad idea

I look forward to that day.

bangaladore1y ago

That's already happened. Its well established now that the internet is tainted. After essentially ChatGPT's public release, a non-insignificant amount of internet content is not written by humans.

tensorOP1y ago

Yes, this is a real and serious concern that AI researchers have.

fumeux_fume1y ago

I think the point is that if R1 isn't possible without access to OpenAI (at low, subsidized costs) then this isn't really a breakthrough as much as a hack to clone an existing model.

bbor1y ago

tensorOP1y ago

The training techniques are a breakthrough no matter what data is used. It's not up for debate, it's an empirical question with a concrete answer. They can and did train orders of magnitude faster.

blast1y ago

Not arguing with your point about training efficiency, but the degree to which R1 is a technical breakthrough changes if they were calling an outside API to get the answers, doesn't it?

It seems like the difference between someone doing a better writeup of (say) Wiles's proof vs. proving Fermat's Last Theorem independently.

pests1y ago

That outside API used to be humans, doing the work manually. Now we have ways to speed that up.

athrowaway3z1y ago

Thats not right either.

It proofs we _can_ optimize our training data.

sailingparrot1y ago

> First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually).

That is not true at all.

We have known how to solve this for at least 2 years now.

All the latest state of the art models depend heavily on training on synthetic data.

bjourne1y ago

https://www.nature.com/articles/s41586-024-07566-y

sailingparrot1y ago

Key point from your linked paper:

> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models

No one is training on indiscriminate synthetic data. It's very much discriminated, but still synthetic.

jjallen1y ago

The DS R1 Model is slightly better though. So how does your statement square with that?

j / k navigate · click thread line to collapse