Diffusion on syntax trees for program synthesis (opens in new tab)

(tree-diffusion.github.io)

401 pointspouwerkerk2y ago88 comments

88 comments

60 comments · 20 top-level

pmayrgundter2y ago· 11 in thread

It's funny, this kind of subtree mutation was looked at pretty deeply by Koza and Adamı in the 90s under the rubric of Genetic Algorithms, but with a slightly different optimization function

One ref in the paper to 2000 for GAs for fast generation of program trees, but that's missing the main show

Hope they're reading this and dig into those guys work

verdverm2y ago

Some more recent alternatives to Koza's GP use some very different search mechanisms. FFX & PGE are both very fast.

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...

https://arxiv.org/pdf/2209.09675

I authored PGE and have thought that RL, and more recently diffusion techniques, might help these algos. All of the algos need better ways to guide the search, or help it get unstuck from local optima, which happens surprisingly fast. Most work in GP / EVO is about avoiding premature convergence.

1 more reply

pmayrgundter2y ago

On my last comment I suggested the authors may not be familiar with Koza+Adami, but didn't realize the corresponding author is Stuart Russell, co-author of "Artificial Intelligence: A Modern Approach", with Peter Norvig.. the "The authoritative, most-used AI textbook, adopted by over 1500 schools." according to their site. https://aima.cs.berkeley.edu/

Whoops!

vidarh2y ago

Genetic Programming [1], specifically. I have both his two bricks from '92 and '94 (Genetic Programming: On the Programming of Computers by Means of Natural Selection, and Genetic Programming II : Automatic Discovery of Reusable Programs). I've not read his two later ones.

The big problem they seemed to get stuck at was partially doing it fast enough, and partially ending up with a result that was comprehensible. The latter in particular seems to be far better with LLMs. You tended to end up spending a lot of time trying to reorganise and prune trees to get something that you could decipher, and so it seemed like the primary, and too limited, value became algorithms where you could invest a lot of resources into trying to find more optimal versions of very small/compact algorithms you could justify spending time on. But the challenged there is that there are often so many far lower hanging fruits in most code bases that few people get to the point where it's worth trying.

I still love the idea at a conceptual level...

[1] https://www.genetic-programming.com/johnkoza.html

tithe2y ago

Are these the references?

- https://web.archive.org/web/20021224053225/http://smi-web.st...

- https://www.genetic-programming.com/jkpdf/tr1314.pdf

29athrowaway2y ago

You can also say backpropagation is the chain rule from centuries ago.

telotortium2y ago

It is a computationally clever application of the chain rule to minimize the amount of computation needed to compute gradients for all parameters in the network.

1 more reply

pmayrgundter2y ago

I totally didn't realize this until these comments. Neat!

I went digging in wikipedia.. the Backpropagation article was created in 2005 and yet the mention of association/derivation from the chain rule wasn't mentioned until 2014, through a borrow from the German article

https://en.wikipedia.org/w/index.php?title=Backpropagation&o...

elijahbenizzy2y ago

Backpropogration is just an application of the chain rule -- cool that we all learned it in high school!

sandos2y ago

Wow, what a flash from the past! Was playing around a LOT with GP around that time, and the name Koza is certainly familiar. I even think I did some semi-similar things, instead of my normal approach which was simplistic but inefficient in that lots of invalid code was generated.

teruakohatu2y ago

These kind of Genetic Algorithms are still being researched in academia. I attended a seminar a couple of years ago on the subject. It’s still a total dead end imho.

Chabsff2y ago

I used to (early 00s) be super big into GA and GP until a professor of mine at the time described the whole class of algorithms as "Marginally better than brute forcing". That really resonated with my experience, and was just too spot-on to ignore.

2 more replies

artninja19882y ago· 7 in thread

Surprised to see Stuart Russells name on this as I thought he was fully consumed by the doomsday cult. Although he's last author so he's probably only on it because he's the head of the lab

robxorb2y ago

What's with the last author/first author thing in science papers?

I've read several times that the author listed last is usually the most significant contributor, and the first author the least significant, due to some kind of tradition around modesty plus favourably introducing new names. (Which then of course doesn't work, if everyone knows it's happening...)

Here, you've interpreted it as the reverse, and by that I mean in the sensible way - did you not know about the tradition, or are you wrong? And how can we know either way for sure?

fabian2k2y ago

Conventions vary by field, but within a specific field they're usually pretty consistent. In natural sciences (except large physics papers) the convention is that the first author is the one doing most of the practical work. The last author is the PI (principal investigator) of the group who had a hand in designing the experiments and oversaw the research. Now, the latter can mean anything from barely doing any work on the paper to being deeply involved in the research.

If you're reading papers most of the time the last author is more meaningful to you because they're the senior researcher, you know their research interests and what kind of papers they produce. The first authors are PhD students and PostDocs, they change much more often.

1 more reply

optimalsolver2y ago

>the author listed last is usually the most significant contributor

Where did you read that?

That's definitely not the case in machine learning.

1 more reply

ngruhn2y ago

I haven’t heard anyone make sane case against the doomsday argument. Only attacks.

ImHereToVote2y ago

You can't be scared while you laugh at someone. So laughing is a good thing to do to stave of fear.

samatman2y ago

A lot of doomers work on AI. While frowning, and shaking their heads very gravely, so you know they don't approve.

optimalsolver2y ago

If we don't get to AGI first, the bad guys will.

1 more reply

can16358p2y ago· 4 in thread

I wonder how this would apply to compiler/interpreter optimizations.

Is it possible that it can "disect" some parts of the execution, perhaps at assembly level, and come up with optimizations specific to the compiled code without changing the output (I mean expected program output, not emitted binary), that modern compilers have not deterministically come up with?

bastawhiz2y ago

I expect the answer is "no". I wouldn't expect a tool like this to "discover" assembly without being trained on the compiled output. The model has no notion of how or where the code runs.

After decades of compiler research and super compilers chugging away, we're sort of at a point where discovering novel optimizations with results that are more than a smidge of improvement is almost impossibly unlikely. Compilers today are really good.

That said, I think the value that something like this might have is being able to optimize the intent of the code. If it can determine that I'm sorting some numbers, it can rewrite my code to use a faster sorting algorithm that has the same functional properties. If I'm storing data that never gets used, it can stop storing it. It has a view of the code at a level above what the compiler sees, with an understanding not just of what is being done, but why.

xavxav2y ago

> After decades of compiler research and super compilers chugging away, we're sort of at a point where discovering novel optimizations with results that are more than a smidge of improvement is almost impossibly unlikely. Compilers today are really good.

I agree when it comes to peephole optimizations, but there's still a lot of juice left in exploiting language guarantees (immutability, non-aliasing, data-parallelism), however most compiler developer energy is spent propping up C/C++ and consequently optimizations are developed with those languages in mind.

hoosieree2y ago

My dissertation worked on a similar problem. I used obfuscation to build a large dataset from a small set of ground-truth functions. Then built a model to classify unseen obfuscated binary code to the nearest of the known functions.

The application I had in mind during the research was anti-malware static analysis, but optimization is really just the flipside of obfuscation. Something I'd like to try in the future is a diffusion model that treats obfuscation as "noise" to be removed.

One thing I learned is that optimizing compilers produce very regular output. After normalizing addresses, the "vocabulary" size of basic blocks ends up pretty small, like ~2000 tokens. Certain "phrases" correlate with the original source code's semantics regardless of how much obfuscation you add on top.

gergo_barany2y ago

This is called superoptimization: https://en.wikipedia.org/wiki/Superoptimization

There are people applying synthesis techniques to superoptimization. So something like this would possibly apply.

gastonmorixe2y ago· 4 in thread

could diffusion work at binary level? I mean, could we train a diffusion model to generate a final binary of a program given a prompt? probably AST may be better but the binary I feel is extremely easy to at least test fast if it works or not. Though there may be a lot of drawbacks, if this is possible I can't wait until we ask "give me an app that does that" and the diffusion model starts generating all te bytes the app to do the job. Just wondering

fulafel2y ago

Editing with feedback from program output, like in this work, could be more closely applicable if you first disassemble the binary and have it edit things in the assembly language AST, then reassemble. This would result in a higher likelihood of creating a valid program.

dcreater2y ago

That would be mind blowing. Why go through all the lost intermediary steps, especially through Python and JS, when you can generate machine code directly

eternauta3k2y ago

If your model is error-prone, having control structures, types and other compile-time checks is very valuable. It's harder to constrain arbitrary machine code to make something sensible.

1 more reply

treyd2y ago

Interpretability, reasoning about the generated code, portability, etc. Probably also demands a larger model with a higher cost of training (and likely more training data) to target a more unstructured "language" like machine code.

sakras2y ago· 3 in thread

This is very cool! My first thought is: can this be applied to converting raster graphics to vector graphics (eg PNG to SVG)? Seems like a very similar problem, though probably much more computationally expensive.

szvsw2y ago

> We apply our approach to inverse graphics tasks, where our model learns to convert images into programs that produce those images.

I would argue that at least on a philosophical level, this is, definitionally, the process of converting raster graphics to vector graphics, as long as you by the premise that the difference between the two is simply that vector gfx is a programmatic/imperative representation of image generation, while raster is a data structure/declarative representation of images.

In other words, simply put, raster images are just arrays, vector images are sequences of instructions. Raster images are naturally much closer to the “raw” data needed by the output mechanism, while vector images require a much more complex interpreter.

adrianmonk2y ago

Or, raster and vector images are philosophically the same thing. The only difference is that vector has more operations than raster. Raster just has "draw unit square at integer coordinates".

2 more replies

andybak2y ago

> vector gfx is a programmatic/imperative representation of image generation, while raster is a data structure/declarative representation of images.

This seems a bit off to me.

Aside from oddities such as Postscript, most vector formats are canonical examples of declarative code. The distinction is more about what is being represented rather than how it is represented.

whereismyacc2y ago· 2 in thread

There's been talk in the past about github adding integrations with common build tools (automatically?). What if you could compile every llvm-compiled project on github and run a diffusion model over the intermediate representations?

daralthus2y ago

what's the output?

whereismyacc2y ago

I guess the output would be an llvm intermediate representation that you can compile down and run, right? I'm stretching pretty far past my knowledge here.

1 more reply

lwansbrough2y ago· 2 in thread

I’d like to see it with SDFs!

grondilu2y ago

Please elaborate. Are you thinking of approximating the distance function with an algebraic expression, with algebra itself being the "programming language"?

Philpax2y ago

You can represent arbitrary shapes through the composition of SDFs: https://iquilezles.org/articles/distfunctions/

These can be treated as parameterised nodes in a tree, similar to what's happening here. It follows that there may be a possible adaptation of this to SDF composition, such that you can give it a shape and have it produce the SDF nodes + composition required to produce that shape.

Most existing approaches to SDFs with NNs have the NN itself take on the role of the SDF (i.e. given a point, it predicts the distance), so there's a compelling opportunity here to build a system that can produce spatial representations from existing imagery without NN inference at render-time.

I imagine adding the third dimension to the problem makes it much harder, though! I'll have to give the paper a read to determine how coupled to 2D their current approach is.

dwlg002y ago· 2 in thread

I'm failing to see how this is novel. It looks like they're doing diffusion on a representation system for 2D graphics, which is very different than an actual program (they do address this limitation to be fair)

revalo2y ago

Yeah, this is true! These are more like expressions rather than programs. We were mostly following the language used by previous work, https://arxiv.org/abs/1906.04604

hobofan2y ago

Couldn't this be used to do HTML generation from designs? Especially when combined with multiple viewport sizes at the same time, generating a fluid HTML layout would be pretty awesome.

JHonaker2y ago· 1 in thread

Markov Chain Monte Carlo for program synthesis isn't exactly novel. The most immediate reference I thought of is Josh Tenenbaum's [1].

There's also a lot of demos in WebPPL (web probabilistic programming language)[2] like [3] for the synthesis of 3D space-ships. I highly recommend their associated books on The Design and Implementation of Probabilistic Programming Languages [4] and Probabilistic Models of Cognition [5].

I also highly recommend taking a look at the publications of the MIT Probabilistic Computing Project [6].

[1] Human-level concept learning through probabilistic program induction. https://www.cs.cmu.edu/~rsalakhu/papers/LakeEtAl2015Science....

[2] http://webppl.org/

[3] https://dritchie.github.io/web-procmod/

[4] https://dippl.org/

[5] http://probmods.org/

[6] http://probcomp.csail.mit.edu/

marcelroed2y ago

It’s worth noting that Shreyas (the first author) was a student with Tenenbaum at MIT before he went to Berkeley

dinobones2y ago· 1 in thread

I don't understand the "magic" here.

In a traditional approach, you would generate random images, calculate some distance metric, then use some optimization method like simulated annealing to minimize the distance.

I get that the difference between the image representations is being optimzied here, but how is it possible that changing tokens in a program is differentiable?

revalo2y ago

Changing tokens in a program is not differentiable. For me, the key idea is that you can train a neural model to suggest edits to programs by randomly mutating nodes. And when you run this neural model, you get to make edits that are syntactically correct (i.e., a number will only replace a number etc.) according to a context-free grammar.

montyanderson2y ago· 1 in thread

This is fascinating. I've been trying to envisage how the new language models will have deeper or lower-level role in software production than simple code generation.

passion__desire2y ago

I think browsers could be the next iteration. Website backend will have premade flows. e.g. a transfer money from my account to another account, etc. And through fluidic UIs, the website will collect info need, necessary approvals before flow submission. AI-based browser DOM manipulation.

machiaweliczny2y ago· 1 in thread

I had idea about doing something similar based on DifussER paper. One would need to model code edits as algebra similar to add char, replace char or delete char but something like define func, remove func, define var etc. I am undereducated to do it myself but have a feeling it could work.

machiaweliczny2y ago

I will have to dig into this paper as it looks like exactly this. I wonder if they use closures to limit valid operations space.

The only thing I didn’t understand to make it happen was how to connect it well to description of desired program or edit.

BTW my idea was to train it by destructing programs available on github (so adding noise via some random valid ops and then removing it to retrieve original program). Probably best done in N-1 commit is treated as noise and moving back to commit 0

behnamoh2y ago· 1 in thread

How is it different from genetic algorithms that mutate the syntax tree until the target output is achieved?

Karellen2y ago

It's different in the same way that using an LLM instead of a traditional Markov chain is a different way of generating text. You're still predicting the next word at a time to hopefully end up with plausible sentences/paragraphs, but the difference is in how you model the training dataset, and how you use that model to make each next choice in your live application.

zelphirkalt2y ago

This sounds more similar to what people have done with Racket and hint generation for MOOCs. Not sure which university it is again, but I saw a presentation about how they generate hints for students by mutating the syntax tree and analyzing how they had to modify it, to get to a target solution. It was presented at some RacketCon, maybe a decade ago already. Perhaps that knowledge how to do it can be combined with newer machine learning approaches?

EDIT: I found the talk: https://invidious.baczek.me/watch?v=ijyFC36kVis

flakiness2y ago

The PDF is super slow to render, I guess because it contains commands from programmatically generated figures. It gives it a kind of an academic-paper-feel I miss these days.

https://arxiv.org/pdf/2405.20519

aquarius02y ago

The application to inverse graphics tasks reminds me of this paper which was released one week earlier: https://arxiv.org/abs/2405.15306

grondilu2y ago

The application to graphics is interesting. It seems to me that current image generation models struggle with stylized pictures ("ligne claire" in comics, geometric shapes and so on). After all this kinds of pictures should be easy to encode in vectoriel formats (like SVG), which are basically programming languages.

pamelafox2y ago

I would love to see them try this with the Processing/P5Js libraries. Thats what the ASTs reminded me of. It could potentially be used to help students trying to figure out how to fix their programs. I used AST-based hints for my ProcessingJS courses on Khan Academy, but I handwrote the AST patterns for those hints.

goexploration2y ago

The beam search idea is interesting. Curious to know if beam search for reverse diffusion has been done before.

Could anyone clarify how they integrate beam search with the reverse diffusion- do they sample m > k nodes from a reverse diffusion step and expand only the top k nodes?

ofou2y ago

Here's a GPTo summary of the paper

https://www.emergentmind.com/papers/2405.20519

j / k navigate · click thread line to collapse

88 comments

60 comments · 20 top-level

pmayrgundter2y ago· 11 in thread

It's funny, this kind of subtree mutation was looked at pretty deeply by Koza and Adamı in the 90s under the rubric of Genetic Algorithms, but with a slightly different optimization function

One ref in the paper to 2000 for GAs for fast generation of program trees, but that's missing the main show

Hope they're reading this and dig into those guys work

verdverm2y ago

Some more recent alternatives to Koza's GP use some very different search mechanisms. FFX & PGE are both very fast.

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...

https://arxiv.org/pdf/2209.09675

1 more reply

pmayrgundter2y ago

Whoops!

vidarh2y ago

I still love the idea at a conceptual level...

[1] https://www.genetic-programming.com/johnkoza.html

tithe2y ago

Are these the references?

- https://web.archive.org/web/20021224053225/http://smi-web.st...

- https://www.genetic-programming.com/jkpdf/tr1314.pdf

29athrowaway2y ago

You can also say backpropagation is the chain rule from centuries ago.

telotortium2y ago

It is a computationally clever application of the chain rule to minimize the amount of computation needed to compute gradients for all parameters in the network.

1 more reply

pmayrgundter2y ago

I totally didn't realize this until these comments. Neat!

https://en.wikipedia.org/w/index.php?title=Backpropagation&o...

elijahbenizzy2y ago

Backpropogration is just an application of the chain rule -- cool that we all learned it in high school!

sandos2y ago

teruakohatu2y ago

These kind of Genetic Algorithms are still being researched in academia. I attended a seminar a couple of years ago on the subject. It’s still a total dead end imho.

Chabsff2y ago

2 more replies

artninja19882y ago· 7 in thread

Surprised to see Stuart Russells name on this as I thought he was fully consumed by the doomsday cult. Although he's last author so he's probably only on it because he's the head of the lab

robxorb2y ago

What's with the last author/first author thing in science papers?

Here, you've interpreted it as the reverse, and by that I mean in the sensible way - did you not know about the tradition, or are you wrong? And how can we know either way for sure?

fabian2k2y ago

1 more reply

optimalsolver2y ago

>the author listed last is usually the most significant contributor

Where did you read that?

That's definitely not the case in machine learning.

1 more reply

ngruhn2y ago

I haven’t heard anyone make sane case against the doomsday argument. Only attacks.

ImHereToVote2y ago

You can't be scared while you laugh at someone. So laughing is a good thing to do to stave of fear.

samatman2y ago

A lot of doomers work on AI. While frowning, and shaking their heads very gravely, so you know they don't approve.

optimalsolver2y ago

If we don't get to AGI first, the bad guys will.

1 more reply

can16358p2y ago· 4 in thread

I wonder how this would apply to compiler/interpreter optimizations.

bastawhiz2y ago

I expect the answer is "no". I wouldn't expect a tool like this to "discover" assembly without being trained on the compiled output. The model has no notion of how or where the code runs.

xavxav2y ago

hoosieree2y ago

gergo_barany2y ago

This is called superoptimization: https://en.wikipedia.org/wiki/Superoptimization

There are people applying synthesis techniques to superoptimization. So something like this would possibly apply.

gastonmorixe2y ago· 4 in thread

fulafel2y ago

dcreater2y ago

That would be mind blowing. Why go through all the lost intermediary steps, especially through Python and JS, when you can generate machine code directly

eternauta3k2y ago

If your model is error-prone, having control structures, types and other compile-time checks is very valuable. It's harder to constrain arbitrary machine code to make something sensible.

1 more reply

treyd2y ago

sakras2y ago· 3 in thread

szvsw2y ago

> We apply our approach to inverse graphics tasks, where our model learns to convert images into programs that produce those images.

adrianmonk2y ago

Or, raster and vector images are philosophically the same thing. The only difference is that vector has more operations than raster. Raster just has "draw unit square at integer coordinates".

2 more replies

andybak2y ago

> vector gfx is a programmatic/imperative representation of image generation, while raster is a data structure/declarative representation of images.

This seems a bit off to me.

Aside from oddities such as Postscript, most vector formats are canonical examples of declarative code. The distinction is more about what is being represented rather than how it is represented.

whereismyacc2y ago· 2 in thread

daralthus2y ago

what's the output?

whereismyacc2y ago

I guess the output would be an llvm intermediate representation that you can compile down and run, right? I'm stretching pretty far past my knowledge here.

1 more reply

lwansbrough2y ago· 2 in thread

I’d like to see it with SDFs!

grondilu2y ago

Please elaborate. Are you thinking of approximating the distance function with an algebraic expression, with algebra itself being the "programming language"?

Philpax2y ago

You can represent arbitrary shapes through the composition of SDFs: https://iquilezles.org/articles/distfunctions/

I imagine adding the third dimension to the problem makes it much harder, though! I'll have to give the paper a read to determine how coupled to 2D their current approach is.

dwlg002y ago· 2 in thread

revalo2y ago

Yeah, this is true! These are more like expressions rather than programs. We were mostly following the language used by previous work, https://arxiv.org/abs/1906.04604

hobofan2y ago

Couldn't this be used to do HTML generation from designs? Especially when combined with multiple viewport sizes at the same time, generating a fluid HTML layout would be pretty awesome.

JHonaker2y ago· 1 in thread

Markov Chain Monte Carlo for program synthesis isn't exactly novel. The most immediate reference I thought of is Josh Tenenbaum's [1].

I also highly recommend taking a look at the publications of the MIT Probabilistic Computing Project [6].

[1] Human-level concept learning through probabilistic program induction. https://www.cs.cmu.edu/~rsalakhu/papers/LakeEtAl2015Science....

[2] http://webppl.org/

[3] https://dritchie.github.io/web-procmod/

[4] https://dippl.org/

[5] http://probmods.org/

[6] http://probcomp.csail.mit.edu/

marcelroed2y ago

It’s worth noting that Shreyas (the first author) was a student with Tenenbaum at MIT before he went to Berkeley

dinobones2y ago· 1 in thread

I don't understand the "magic" here.

In a traditional approach, you would generate random images, calculate some distance metric, then use some optimization method like simulated annealing to minimize the distance.

I get that the difference between the image representations is being optimzied here, but how is it possible that changing tokens in a program is differentiable?

revalo2y ago

montyanderson2y ago· 1 in thread

This is fascinating. I've been trying to envisage how the new language models will have deeper or lower-level role in software production than simple code generation.

passion__desire2y ago

machiaweliczny2y ago· 1 in thread

machiaweliczny2y ago

I will have to dig into this paper as it looks like exactly this. I wonder if they use closures to limit valid operations space.

The only thing I didn’t understand to make it happen was how to connect it well to description of desired program or edit.

behnamoh2y ago· 1 in thread

How is it different from genetic algorithms that mutate the syntax tree until the target output is achieved?

Karellen2y ago

zelphirkalt2y ago

EDIT: I found the talk: https://invidious.baczek.me/watch?v=ijyFC36kVis

flakiness2y ago

The PDF is super slow to render, I guess because it contains commands from programmatically generated figures. It gives it a kind of an academic-paper-feel I miss these days.

https://arxiv.org/pdf/2405.20519

aquarius02y ago

The application to inverse graphics tasks reminds me of this paper which was released one week earlier: https://arxiv.org/abs/2405.15306

grondilu2y ago

pamelafox2y ago

goexploration2y ago

The beam search idea is interesting. Curious to know if beam search for reverse diffusion has been done before.

Could anyone clarify how they integrate beam search with the reverse diffusion- do they sample m > k nodes from a reverse diffusion step and expand only the top k nodes?

ofou2y ago

Here's a GPTo summary of the paper

https://www.emergentmind.com/papers/2405.20519

j / k navigate · click thread line to collapse