EDIT: I found the talk: https://invidious.baczek.me/watch?v=ijyFC36kVis
One ref in the paper to 2000 for GAs for fast generation of program trees, but that's missing the main show
Hope they're reading this and dig into those guys work
https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...
https://arxiv.org/pdf/2209.09675
I authored PGE and have thought that RL, and more recently diffusion techniques, might help these algos. All of the algos need better ways to guide the search, or help it get unstuck from local optima, which happens surprisingly fast. Most work in GP / EVO is about avoiding premature convergence.
Whoops!
The big problem they seemed to get stuck at was partially doing it fast enough, and partially ending up with a result that was comprehensible. The latter in particular seems to be far better with LLMs. You tended to end up spending a lot of time trying to reorganise and prune trees to get something that you could decipher, and so it seemed like the primary, and too limited, value became algorithms where you could invest a lot of resources into trying to find more optimal versions of very small/compact algorithms you could justify spending time on. But the challenged there is that there are often so many far lower hanging fruits in most code bases that few people get to the point where it's worth trying.
I still love the idea at a conceptual level...
- https://web.archive.org/web/20021224053225/http://smi-web.st...
I went digging in wikipedia.. the Backpropagation article was created in 2005 and yet the mention of association/derivation from the chain rule wasn't mentioned until 2014, through a borrow from the German article
https://en.wikipedia.org/w/index.php?title=Backpropagation&o...
There's also a lot of demos in WebPPL (web probabilistic programming language)[2] like [3] for the synthesis of 3D space-ships. I highly recommend their associated books on The Design and Implementation of Probabilistic Programming Languages [4] and Probabilistic Models of Cognition [5].
I also highly recommend taking a look at the publications of the MIT Probabilistic Computing Project [6].
[1] Human-level concept learning through probabilistic program induction. https://www.cs.cmu.edu/~rsalakhu/papers/LakeEtAl2015Science....
In a traditional approach, you would generate random images, calculate some distance metric, then use some optimization method like simulated annealing to minimize the distance.
I get that the difference between the image representations is being optimzied here, but how is it possible that changing tokens in a program is differentiable?
Is it possible that it can "disect" some parts of the execution, perhaps at assembly level, and come up with optimizations specific to the compiled code without changing the output (I mean expected program output, not emitted binary), that modern compilers have not deterministically come up with?
After decades of compiler research and super compilers chugging away, we're sort of at a point where discovering novel optimizations with results that are more than a smidge of improvement is almost impossibly unlikely. Compilers today are really good.
That said, I think the value that something like this might have is being able to optimize the intent of the code. If it can determine that I'm sorting some numbers, it can rewrite my code to use a faster sorting algorithm that has the same functional properties. If I'm storing data that never gets used, it can stop storing it. It has a view of the code at a level above what the compiler sees, with an understanding not just of what is being done, but why.
I agree when it comes to peephole optimizations, but there's still a lot of juice left in exploiting language guarantees (immutability, non-aliasing, data-parallelism), however most compiler developer energy is spent propping up C/C++ and consequently optimizations are developed with those languages in mind.
The application I had in mind during the research was anti-malware static analysis, but optimization is really just the flipside of obfuscation. Something I'd like to try in the future is a diffusion model that treats obfuscation as "noise" to be removed.
One thing I learned is that optimizing compilers produce very regular output. After normalizing addresses, the "vocabulary" size of basic blocks ends up pretty small, like ~2000 tokens. Certain "phrases" correlate with the original source code's semantics regardless of how much obfuscation you add on top.
There are people applying synthesis techniques to superoptimization. So something like this would possibly apply.
These can be treated as parameterised nodes in a tree, similar to what's happening here. It follows that there may be a possible adaptation of this to SDF composition, such that you can give it a shape and have it produce the SDF nodes + composition required to produce that shape.
Most existing approaches to SDFs with NNs have the NN itself take on the role of the SDF (i.e. given a point, it predicts the distance), so there's a compelling opportunity here to build a system that can produce spatial representations from existing imagery without NN inference at render-time.
I imagine adding the third dimension to the problem makes it much harder, though! I'll have to give the paper a read to determine how coupled to 2D their current approach is.
I would argue that at least on a philosophical level, this is, definitionally, the process of converting raster graphics to vector graphics, as long as you by the premise that the difference between the two is simply that vector gfx is a programmatic/imperative representation of image generation, while raster is a data structure/declarative representation of images.
In other words, simply put, raster images are just arrays, vector images are sequences of instructions. Raster images are naturally much closer to the “raw” data needed by the output mechanism, while vector images require a much more complex interpreter.
This seems a bit off to me.
Aside from oddities such as Postscript, most vector formats are canonical examples of declarative code. The distinction is more about what is being represented rather than how it is represented.
The only thing I didn’t understand to make it happen was how to connect it well to description of desired program or edit.
BTW my idea was to train it by destructing programs available on github (so adding noise via some random valid ops and then removing it to retrieve original program). Probably best done in N-1 commit is treated as noise and moving back to commit 0
Could anyone clarify how they integrate beam search with the reverse diffusion- do they sample m > k nodes from a reverse diffusion step and expand only the top k nodes?
I've read several times that the author listed last is usually the most significant contributor, and the first author the least significant, due to some kind of tradition around modesty plus favourably introducing new names. (Which then of course doesn't work, if everyone knows it's happening...)
Here, you've interpreted it as the reverse, and by that I mean in the sensible way - did you not know about the tradition, or are you wrong? And how can we know either way for sure?
If you're reading papers most of the time the last author is more meaningful to you because they're the senior researcher, you know their research interests and what kind of papers they produce. The first authors are PhD students and PostDocs, they change much more often.
Where did you read that?
That's definitely not the case in machine learning.