Neural Network Diffusion (opens in new tab)

(arxiv.org)

223 pointsvagabund2y ago86 comments

86 comments

42 comments · 15 top-level

falcor842y ago· 12 in thread

Seems like we're getting very close to recursive self-improvement [0].

[0] https://www.lesswrong.com/tag/recursive-self-improvement

astrange2y ago

No, this is an example of an existing technique called hypernetworks.

It's not "recursive self improvement", which is just a belief that magic is real and you can wish an AI into existence. In particular, this one needs too much training data, and you can't define "improvement" without knowing what to improve to.

FeepingCreature2y ago

All current LLMs are based on the premise that magic is real and you can wish intelligence into existence; it's called "scaling laws" and "emergent capabilities".

Recursive self-improvement isn't "maybe magic is real", it's "maybe the magic we already know about stays magical as we cast our spells with more mana."

1 more reply

killerstorm2y ago

> which is just a belief that magic is real

Is there a law of thermodynamics which prevents AI from writing code which would train a better AI? Never learned that one in school.

And FYI here's OpenAI plan to align superintelligence: "Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence."

I guess people working there believe in magic.

> and you can wish an AI into existence.

Eh? People believe that self-improvement might happen when AI is around human-level.

5 more replies

mattnewton2y ago

I upvoted because this was my first thought too, but reading the abstract and skimming the paper makes me think it’s not really an advance for general recursive improvement. I think the title makes people think this is a text -> model model, when it is really a bunch of model weights -> new model weights optimizer for a specific architecture and problem. Still a potentially very useful idea for learning from a bunch of training runs and very interesting work!

fnordpiglet2y ago

I suspect this is useful for porting one vector space to another which is an open problem when you’ve trained one model with one architecture and need to port it to another architecture without paying the full retraining cost.

GuB-422y ago

Doesn't look that different from what we are already doing. For example AlphaGo/AlphaZero/MuZero learn to play board games by playing repeatedly against itself, it is a self improvement loop leading to superhuman play. It was a major breakthrough for the game of Go, and it lead to advances in the field of machine learning, but we are still far from something resembling technological singularity.

GANs are another example of self-improvement. It was famous for creating "deep fakes". It works by pitting a fake generator and a fake detector against each other, resulting in a cycle of improvement. It didn't get much further than that, in fact, it is all about attention and transformers now.

This is just a way of optimizing parameters, it will not invent new techniques. It can say "put 1000 neurons there, 2000 there, etc...", but it still has to pick from what designers tell it to pick from. It may adjust these parameters better than a human can, leading to more efficient systems, I expect some improvement to existing systems, but not a breaking change.

pests2y ago

Go and Chess still has rules that are hard coded which at least gives a framework to optimize in. What rules do you give an LLM?

drdeca2y ago

Some sort of "generate descriptions of novel tasks including ways to evaluate performance at those tasks, evaluate quality of the generated tasks+evaluation-metrics, split tasks into subtasks, estimate difficulty of tasks in a way that is is judged on how it compares to a combined estimated difficulty of generated subtasks and to actual success rate and quality" sort of deal?

spangry2y ago

Physics.

1 more reply

AgentME2y ago

The real magic of recursive self improvement happens only after you have human-level AI that is able to match and surpass human ability in designing AI architectures. Escape-velocity-breaking recursive self improvement doesn't look like a human-made architecture being trained further, it looks like an AI understanding why transformers/etc were successful and coming up with an advancement over transformers.

philsnow2y ago

A rare opportunity for the other four-letter comic to be applicable: http://smbc-comics.com/comic/2011-12-13

(Though I suppose this skips Neuralink / step 3 and jumps right to step 4.)

bamboozled2y ago

The ai is ready to take off to perfection land

vessenes2y ago· 7 in thread

I wasn't sure if this paper was parody on reading the abstract. It's not parody. Two things stand out to me: first is the idea of distilling these networks down into a smaller latent space, and then mucking around with that. That's interesting, and cross-sections a bunch of interesting topics like interpretability, compression, training, over- and under-.. The second is that they show the diffusion models don't just converge on similar parameters as the ones they train against/diffuse into, and that's also interesting.

I confess I'm not sure what I'd do with this in the random grab bag of Deep Learning knowledge I have, but I think it's pretty fascinating. I might like to see a trained latent encoder that works well on a bunch of different neural networks; maybe that thing would be a good tool for interpreting / inspecting.

daxfohl2y ago

Seems like it could be useful for resizing the networks, no? Start with ChatGPT 4 then release an open version of it with much fewer parameters.

Or maybe some metaparameter that mucks with the sizes during training produces better results. Start large to get a baseline, then reduce size to increase coherence and learning speed, then scale up again once that is maxed out.

SubiculumCode2y ago

Perhaps doing this to generate 10 similar but different versions of a model can then be fed into mixture of experts?

vessenes2y ago

Ooh that’s a good idea! Although mistral seems to have been seeded with identical copies of mistral, so maybe it doesn’t buy you much? Sounds worth trying though!

SubiculumCode2y ago

The deep problem of my life: I'm interested in so many things, but only have time to pursue one hobby and one neuroscience career. If it is indeed a good idea, its only from connecting gleaned generalizations with other gleaned generalizations; but the devil is often in the details; and I will never have enough time to try myself. :)

daxfohl2y ago

Or a good way to teleport out of local minima while training. Create a few clones and take the one with the steepest gradients.

namibj2y ago

Hmmm, I could think of using it to update a DDPM with a conditioning input as the dataset expands from an RL/online process, without ruining the conditioning mechanism that's only trainable through the actual RL itself.

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

(Btw., if anyone has tips on how to not wreck the RL training's effort when updating the base model with the recently encountered semantically-valid training samples that can be used self-supervised, please tell. I'd hate to throw away the RL effort expended to aquire that much taking data for good self-supervised operation. It's already looking fairly expensive...)

daxfohl2y ago

You could use this and try to tease out something similar to https://news.ycombinator.com/item?id=39487124, but for NNs instead of images. Maybe it's possible to have this NN diffusion model explain the pieces of the NN they generate and why parameters have those values.

If we can get that, then maybe we don't even need to train anymore; it'd be possible to start to generate NNs algorithmically.

vagabundOP2y ago· 2 in thread

Author thread: https://twitter.com/liuzhuang1234/status/1760195922502312197

squigz2y ago

Is there any sites for viewing Twitter threads without signing up?

f_devd2y ago

https://nitter.esmailelbob.xyz/liuzhuang1234/status/17601959...

(bit of trial and error from https://github.com/zedeus/nitter/wiki/Instances)

goggy_googy2y ago· 2 in thread

Important to note, they say "From these generated models, we select the one with the best performance on the training set." Definitely potential for bias here.

nerdponx2y ago

I'd have liked to see the distribution of generated model performance.

QuadmasterXLII2y ago

Fig 4b

justanotherjoe2y ago· 1 in thread

fuck. I have an idea just like this one. I guess it's true that ideas are a dime a dozen. Diffusions bear a remarkable similarity to backpropagation to me. I thought that it could be used in place of it for some parts of a model.

Furthermore, I posit that resnet especially in transformers allows the model into a more exploratory behavior that is really powerful, and is a necessary component of the power of transformers. Transformers is just such a great architecture the more i think about it. It's doing so many things so right. Although this is not really related to the topic.

crotchfire2y ago

Actually it is related.

Transformers are just networks that learn to program the weights of other networks [1]. In the successful cases the programmed network has been quite primitive -- merely a key-value store -- in order to ensure that you can backpropagate errors from the programmed network's outputs all the way to the programmer network's inputs.

The present work extends this idea to a different kind of programmed network: a convolutional image-processing network.

There are many more breakthroughs to be achieved along this line of research -- it is a rich vein to mine. I believe our best shot at getting neural networks to do discrete math and symbolic logic, and to write nontrivial computer programs, will result from this line of research.

[1] https://arxiv.org/abs/2102.11174

jackblemming2y ago· 1 in thread

The state of art neural net architecture, whether that be transformers or the like, trained on self play to optimize non-differentiable but highly efficient architectures is the way.

hackerlight2y ago

According to Hinton, before transformers were shown to work well, learning model architectures was Google's main focus

nullc2y ago· 1 in thread

heh https://news.ycombinator.com/item?id=39208213#39211749

HanClinto2y ago

hah, nice! :D

t_serpico2y ago· 1 in thread

i'd wager that adding noise to the weights in a principled fashion would accomplish something similar to this.

jerpint2y ago

I would really be surprised if just adding noise would give you convergence

gwern2y ago

This doesn't seem all that impressive when you compare it to earlier work like 'g.pt' https://arxiv.org/abs/2209.12892 Peebles et al 2022. They cite it in passing, but do no comparison or discussion, and to my eyes, g.pt is a lot more interesting (for example, you can prompt it for a variety of network properties like low vs high score, whereas this just generates unconditionally) and more thoroughly evaluated. The autoencoder here doesn't seem like it adds much.

goggy_googy2y ago

"We synthesize 100 novel parameters by feeding random noise into the latent diffusion model and the trained decoder." Cool that patterns exist at this level, but also, 100 params means we have a long way to go before this process is efficient enough to synthesize more modern-sized models.

Scene_Cast22y ago

Yay, an alternative to backprop & SGD! Really interesting and impressive finding, I was surprised that the network generalizes.

marojejian2y ago

Am i missing something, or is this just a case of "amortized inference", where you train a model (here a diffusion one), to infer something that was previously found via optimization procedure? (here NN parameters).

hoc2y ago

Hm, so does this actually improve/condense the representation for certain applications or is this some more some kind of global expand and collect in network space?

jarrell_mark2y ago

Can this be used to fill in the missing information on the openworm nematode 302 neurons brain simulator?

amelius2y ago

Why does Figure 7 not include a validation curve (afaict only the training curve is shown)?

j / k navigate · click thread line to collapse

86 comments

42 comments · 15 top-level

falcor842y ago· 12 in thread

Seems like we're getting very close to recursive self-improvement [0].

[0] https://www.lesswrong.com/tag/recursive-self-improvement

astrange2y ago

No, this is an example of an existing technique called hypernetworks.

FeepingCreature2y ago

All current LLMs are based on the premise that magic is real and you can wish intelligence into existence; it's called "scaling laws" and "emergent capabilities".

Recursive self-improvement isn't "maybe magic is real", it's "maybe the magic we already know about stays magical as we cast our spells with more mana."

1 more reply

killerstorm2y ago

> which is just a belief that magic is real

Is there a law of thermodynamics which prevents AI from writing code which would train a better AI? Never learned that one in school.

I guess people working there believe in magic.

> and you can wish an AI into existence.

Eh? People believe that self-improvement might happen when AI is around human-level.

5 more replies

mattnewton2y ago

fnordpiglet2y ago

GuB-422y ago

pests2y ago

Go and Chess still has rules that are hard coded which at least gives a framework to optimize in. What rules do you give an LLM?

drdeca2y ago

spangry2y ago

Physics.

1 more reply

AgentME2y ago

philsnow2y ago

A rare opportunity for the other four-letter comic to be applicable: http://smbc-comics.com/comic/2011-12-13

(Though I suppose this skips Neuralink / step 3 and jumps right to step 4.)

bamboozled2y ago

The ai is ready to take off to perfection land

vessenes2y ago· 7 in thread

daxfohl2y ago

Seems like it could be useful for resizing the networks, no? Start with ChatGPT 4 then release an open version of it with much fewer parameters.

SubiculumCode2y ago

Perhaps doing this to generate 10 similar but different versions of a model can then be fed into mixture of experts?

vessenes2y ago

Ooh that’s a good idea! Although mistral seems to have been seeded with identical copies of mistral, so maybe it doesn’t buy you much? Sounds worth trying though!

SubiculumCode2y ago

daxfohl2y ago

Or a good way to teleport out of local minima while training. Create a few clones and take the one with the steepest gradients.

namibj2y ago

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

daxfohl2y ago

If we can get that, then maybe we don't even need to train anymore; it'd be possible to start to generate NNs algorithmically.

vagabundOP2y ago· 2 in thread

Author thread: https://twitter.com/liuzhuang1234/status/1760195922502312197

squigz2y ago

Is there any sites for viewing Twitter threads without signing up?

f_devd2y ago

https://nitter.esmailelbob.xyz/liuzhuang1234/status/17601959...

(bit of trial and error from https://github.com/zedeus/nitter/wiki/Instances)

goggy_googy2y ago· 2 in thread

Important to note, they say "From these generated models, we select the one with the best performance on the training set." Definitely potential for bias here.

nerdponx2y ago

I'd have liked to see the distribution of generated model performance.

QuadmasterXLII2y ago

Fig 4b

justanotherjoe2y ago· 1 in thread

crotchfire2y ago

Actually it is related.

The present work extends this idea to a different kind of programmed network: a convolutional image-processing network.

[1] https://arxiv.org/abs/2102.11174

jackblemming2y ago· 1 in thread

The state of art neural net architecture, whether that be transformers or the like, trained on self play to optimize non-differentiable but highly efficient architectures is the way.

hackerlight2y ago

According to Hinton, before transformers were shown to work well, learning model architectures was Google's main focus

nullc2y ago· 1 in thread

heh https://news.ycombinator.com/item?id=39208213#39211749

HanClinto2y ago

hah, nice! :D

t_serpico2y ago· 1 in thread

i'd wager that adding noise to the weights in a principled fashion would accomplish something similar to this.

jerpint2y ago

I would really be surprised if just adding noise would give you convergence

gwern2y ago

goggy_googy2y ago

Scene_Cast22y ago

Yay, an alternative to backprop & SGD! Really interesting and impressive finding, I was surprised that the network generalizes.

marojejian2y ago

hoc2y ago

Hm, so does this actually improve/condense the representation for certain applications or is this some more some kind of global expand and collect in network space?

jarrell_mark2y ago

Can this be used to fill in the missing information on the openworm nematode 302 neurons brain simulator?

amelius2y ago

Why does Figure 7 not include a validation curve (afaict only the training curve is shown)?

j / k navigate · click thread line to collapse