If you do naive matrix multiplication, you get a sense that you're doing similar work multiple times, but it's hard to quantify just what that duplicated work entails. Compare it to, for example, calculating the size of the union of two sets:
Total size = size(A) + size(B) - size(intersection(A, B))
You have to take out that extra intersection amount because you've counted it twice. What if you could avoid counting it twice in the first place? That's easy, you just iterate over each set once, keeping track of the elements you've already seen.
Strassen's algorithm keeps track of calculations that are needed later on. It's all reminiscent of dynamic programming.
What I find interesting is that it seems the extra savings requires complex values. There must be something going on in the complex plane that is again over-counting with the naive approach.
More details: the Winograd scheme computes (x1+ y2 )(x2+ y1 ) + (x3+y4)(x4+y3)-Ai-Bj, and relies on y2y1 (that comes from expanding the first brackets) cancelling with y1y2 in Bj=y1y2 + y3y4. This is fine when working with numbers, but if you want to apply the algorithm recursively to large matrices, on the highest level of recursion you're going to work with 4x4 block matrices (where each block is a big matrix itself), and for matrices Y2Y1 != Y1Y2 (for general matrices).
Here is a website that tracks fastest (recursively applicable) matrix multiplication algorithms for different matrix sizes, and it stands at 49: https://fmm.univ-lille.fr/4x4x4.html
UPD: s/fields/rings/ and fixed equation rendering
On the other side, it's claimed here that an algorithm that uses only 46 multiplications has been known since 1970: https://mathstodon.xyz/@fredrikj/114508287537669113
1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)
2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.
In other words, the power of Strasssens algorithm comes from a strategy that's similar to / reminiscent of dynamic programming.
"The rank of a tensor depends on the field over which the tensor is decomposed. It is known that some real tensors may admit a complex decomposition whose rank is strictly less than the rank of a real decomposition of the same tensor."
https://en.wikipedia.org/wiki/Tensor_rank_decomposition#Fiel...
However, when researchers (and systems like AlphaEvolve in this context) analyze fast matrix multiplication algorithms like Strassen's, the primary goal is usually to improve the asymptotic complexity (and understand the space of these algorithms better). This complexity is determined by the number of multiplications in the field over which the matrices are defined. * For real matrices, we count real scalar multiplications. * For complex-valued matrices (as in the 4x4 example where AlphaEvolve found a solution with 48 scalar multiplications), "scalar multiplication" refers to a complex scalar multiplication.
The key is that these are the operations you recurse on. Additions, or the constant factor cost of implementing one field's multiplication, don't change the exponent in the `N^(log_base(multiplications))` complexity. They are constant factors.
Of course, for practical performance on a specific 4x4 matrix, one would absolutely dive into the real operations, additions, memory layout, etc., but making 4x4 matrix multiplication practically faster on some particular hardware was not the focus in this section. (We do improve practical implementation of large matrix multiplications on the target hardware in the “Enhancing AI training and inference” section of the blog post.)
(Disclaimer: one of the authors.)
> In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.
> And in 20% of cases, AlphaEvolve improved the previously best known solutions
These sound like incredible results. I'd be curious what kind of improvements were made / what the improvements were.
Like, was that "up to a 32.5% speedup" on some weird edge case and it was negligible speed up otherwise? Would love to see the benchmarks.
It's a very impressive result, but not magic, but also not cheating!
FA achieving a 32.5% speed up? Cool.
Why not submit it as a PR to the Flash Attention repo then? Can I read about it more in detail?
""" Yesterday's news about Sakana AI Labs provided an important lesson for all of us working with AI agents. Their announcement of an AI system that could supposedly optimize CUDA kernels to run 100x faster initially seemed like exactly the kind of use cases we've been hoping for in AI-assisted development.
Like many others, I was excited about it. After all, isn't this exactly what we want AI to do - help us optimize and improve our technical systems?
However, careful investigation by the community (on Twitter) revealed a different story. What really happened? The AI-generated CUDA kernel appeared to achieve incredible speedups, but the code was inadvertently reusing memory buffers containing previous results, essentially bypassing the actual computation. When properly evaluated, the kernel actually runs about 3x slower than the baseline. """
[1] https://www.linkedin.com/posts/ravid-shwartz-ziv-8bb18761_ye...
But how incremental are these advancements?
I picked one at random (B.2 -- the second autocorrelation inequality). Then, I looked up the paper that produced the previous state of the art (https://arxiv.org/pdf/0907.1379). It turns out that the authors had themselves found the upper bound by performing a numerical search using "Mathematica 6" (p.4). Not only did the authors consider this as a secondary contribution (p.2), but they also argued that finding something better was very doable, but not worth the pain:
"We remark that all this could be done rigorously, but one needs to control the error arising from the discretization, and the sheer documentation of it is simply not worth the effort, in view of the minimal gain." (p.5)
So at least in this case it looks like the advancement produced by AlphaEvolve was quite incremental (still cool!).
This is complex automation which by definition compresses the solution into a computable process that works more efficiently than the non-automated process
That, in fact, is the revolutionary part - you’re changing how energy is used to solve the problem.
Like even just for programming. I just had an AI instrument my app for tracing, something I wanted to do for a while, but I didn't know how to do and didn't feel like figuring out how to do it. That's not work we were likely to hire someone to do or that would ever get done if the AI wasn't there. It's a small thing, but small things add up.
LLMs are undoubtedly useful at tasks like code "optimisation" and detecting patterns or redundancies that humans might overlook, but this announcement feels like another polished, hypey blog post from Google.
What's also becoming increasingly confusing is their use of the "Alpha" branding. Originally, it was for breakthroughs like AlphaGo or AlphaFold, where there was a clear leap in performance and methodology. Now it's being applied to systems that, while sophisticated, don't really rise to the same level of impact.
edit: I missed the evaluator in my description, but an evaluation method is applied also in Co-Scientist:
"The AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs. Key reasoning steps include self-play–based scientific debate for novel hypothesis generation, ranking tournaments for hypothesis comparison, and an "evolution" process for quality improvement."[0]
[0]: https://research.google/blog/accelerating-scientific-breakth...
"While AI Co-Scientist represents scientific hypotheses and their evaluation criteria in natural language, AlphaEvolve focuses on evolving code, and directs evolution using programmatic evaluation functions. This choice enables us to substantially sidestep LLM hallucinations, which allows AlphaEvolve to carry on the evolution process for a large number of time steps."
I don't know if I would call this the fabled "self improving feedback loop", but it seems to have some degree of it. It also begs the question if Alphaevolve was being developed for a year, or has been in production for a year. By now it makes sense to hold back on sharing what AI research gems you have discovered.
In the specific context of improving our AI hardware, for example, it's not as simple as coming up with a good idea -- hardware companies hire thousands of people to improve their designs. Prototypes need to be implemented, verified, quantified, compared thoroughly with the alternatives, then the idea is approved for production, which again leads to a cascade of implementation, verification, etc. until they can reach consumers. In order to make these improvements reach the consumer significantly faster you need to accelerate all of the steps of the very simplified pipeline mentioned earlier.
More generally, an argument can be made that we have been in that take off feedback loop for hundreds of years; it's just that the rate of improvement hasn't been as spectacular as we may have hoped for because each incremental step simply isn't that big of a deal and it takes quite a bit of time to reach the next one.
"DeepMind AI Reduces Google Data Centre Cooling Bill by 40%"
https://deepmind.google/discover/blog/deepmind-ai-reduces-go...
Not to sound metaphysical or anything, but dependency on artificial intelligence seems to be something you would find at the peak of Mount Stupid (where the Darwin Awards are kept).
I am late for a chess game, l8r sk8rs.
The use of synthetic data from prior models to create both superior models and distilled models has been going on since at least OpenAI's introduction of RLHF, and probably before that too.
That’s distinct from those prior models providing actual code to improve the next model
>In AlphaEvolve, the evolutionary database implements an algorithm that is inspired by a combination of the MAP elites algorithm [71] and island-based population models [80, 94].
"inspired by" is doing a lot of heavy lifting in this sentence. How do you choose dimensions of variation to do MAP-elites? How do you combine these two algorithms? How loose is the inspiration? It feels like a lot of the secret sauce is in the answers to these questions, and we get a single paragraph on how the evolution procedure works, which is so vague as to tell us almost nothing.
Agreed the dimensions/features are key. These white papers are an insult to science...
> By suggesting modifications in the standard language of chip designers, AlphaEvolve promotes a collaborative approach between AI and hardware engineers to accelerate the design of future specialized chips."
> AlphaEvolve was able to find a simple code rewrite (within an arithmetic unit within the matmul unit) that removed unnecessary bits, a change validated by TPU designers for correctness.
I speculate this could refer to the upper bits in the output of a MAC circuit being unused in a downstream connection (perhaps to an accumulation register). It could also involve unused bits in a specialized MAC circuit for a non-standard datatype.
> While this specific improvement was also independently caught by downstream synthesis tools, AlphaEvolve’s contribution at the RTL stage demonstrates its capability to refine source RTL and provide optimizations early in the design flow.
As the authors admit, this bit-level optimization was automatically performed by the synthesis tool (the equivalent to this in the software-world is dead code elimination being performed by a compiler). They seem to claim it is better to perform this bit-truncation explicitly in the source RTL rather than letting synthesis handle it. I find this dubious since synthesis guarantees that the optimizations it performs do not change the semantics of the circuit, while making a change in the source RTL could change the semantics (vs the original source RTL) and requires human intervention to check semantic equivalence. The exception to this is when certain optimizations rely on assumptions of the values that are seen within the circuit at runtime: synthesis will assume the most conservative situation where all circuit inputs are arbitrary.
I do agree that this reveals a deficiency in existing synthesis flows being unable to backannotate the source RTL with the specific lines/bits that were stripped out in the final netlist so humans can check whether synthesis did indeed perform an expected optimization.
> This early exploration demonstrates a novel approach where LLM-powered code evolution assists in hardware design, potentially reducing time to market.
I think they are vastly overselling what AlphaEvolve was able to achieve. That isn't to say anything about the potential utility of LLMs for RTL design or optimization.
You can't write an evaluation function for general "intelligence"...
> AlphaEvolve enhanced the efficiency of Google's data centers, chip design and AI training processes — *including training the large language models underlying AlphaEvolve itself*.
Singularity people have been talking for decades about AI improving itself better than humans could, and how that results in runaway compounding growth of superintelligence, and now it's here.
If it takes you a week to find a 1% speedup, and the next 0.7% speedup takes you 2 weeks to find ... well, by using the 1% speedup the next one only takes you 13.86 days. This kind of small optimization doesn't lead to exponential gains.
That doesn't mean it's not worthwhile - it's great to save power & money and reduce iteration time by a small amount. And it combines with other optimizations over time. But this is in no way an example of the kind of thing that the singularity folks envisioned, regardless of the realism of their vision or not.
Basically, singularity assumes that you can take the information about the real world "state", and compress it into some form, and predict the state change faster than reality happens. For a subset of the world, this is definitely possible. But for entire reality, it seems that there is a whole bunch of processes that are computationally irreducible, so an AI would never be able to "stay ahead" or so to to speak. There is also the thing about computational ir-reversibility - for example observing a human behavior is seeing the output of a one way hashing function of neural process in our brain that hides a lot of detail and doesn't let you predict it accurately in all cases.
Also, optimization algorithms are nothing new. Even before AI, you could run genetic algorithm or PSO on code, and given enough compute it would optimize the algorithm, including itself. The hard part that nobody has solved this is abstracting this to a low enough level to where its applicable across multiple layers that correspond to any task.
For example, let say you have a model (or rather an algorithm) that has only a single interface, and that is the ability to send ethernet packets, and it hasn't been trained on any real world data at all. If you task it with building you a website that makes money, the same algorithm that iterates over figuring out how to send IP packets then TCP packets then HTTP packets should also be able to figure out what the modern world wide web looks like and what concepts like website and money is, building its knowledge graph and searching on it and interpolating on it to figure out how to solve the problem.
Some related work from a different company: https://sakana.ai/ai-cuda-engineer/
And some academic papers kind of in this space: https://arxiv.org/abs/2206.08896, https://arxiv.org/abs/2302.12170, https://arxiv.org/abs/2401.07102
Making a significant improvement to the state of the art of one particular algorithm is one thing, but I've seen new tools do that since the 80s
I'll be convinced when LLMs start making valuable pull requests, non-obvious corner cases or non-trivial bugs in mature FOSS projects
On the other side I see excitement that the singularity is here.
If the latter were the case surely we wouldn't be reading about it in a published paper, we would already know.
[1] Blog: https://deepmind.google/discover/blog/discovering-novel-algo...
[2] Paper: https://www.nature.com/articles/s41586-022-05172-4
[3] arxiv.org/pdf/2210.04045
[4] arxiv.org/abs/2212.01175 Flip graphs for matrix multiplication
(Reposted from here, where I made a mini deep-dive into this: https://x.com/friederrrr/status/1922846803420119410?t=7jZ34P...)
As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.
I guess that's now becoming true with LLMs.
Faster LLMs -> More intelligence
couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.
If variety is sought, why not beam with nice population statistic.
They are having some success in making it work internally. Maybe only the team that built it can get it to work? But it does seem promising.
As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.
I suppose you could consider that last part (optimizing some metric) "RL".
However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.
There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.
Philosophically, let's say you talk an old LLM through a new discovery. Thanks to your instruction, the LLM now has access to "new" information not in its training data. It is certainly capable of this. The problem in is that this is just laundered human intelligence.
Not saying it is that egregious, but it's a slippery slope from "well, it didn't do all these different things out of the box, unsupervised".
For this reason, hype-driven/novelty-driven sites like HN usually overestimate initial developments, because they overestimate the merit term, and then underestimate later developments - because they now overestimate the error term from their earlier experience.
Deep learning systems have exceeded the hype. In 2016 we saw potential with models like AlphaGo Zero but no one could foresee the capability of LLMs (a type of deep learning model).
They have other models with different names used for different purposes.
It optimized
initializers.normal (0.0
to initializers.normal (0 + 1j * 0,
I thought the results were being reviewed?Anyway, impressive results. That's why OpenAI and Elon were so frightened about Hassabi.
> Most of these discoveries are on open problems suggested to us by external mathematicians Javier Gomez Serrano and Terence Tao, who also advised on how to best formulate them as inputs to AlphaEvolve. This highlights the potential for synergistic partnerships between AI-driven discovery engines like AlphaEvolve and human mathematical expertise.
In 1996, they optimized an FPGA using a genetic algorithm. It evolved gates disconnected from the circuit, but were required.
The circuit exploited the minuscule magnetic fields from the disconnected gates rather than the logical connections.
If it goes well, I could open source it.
What are the things you would want to optimize with such a framework? (So far I've been focusing on optimizing ML training and architecture search itself). Hearing other ideas would help motivate me to open source if there's real demand for something like this.
In my case, I'd mainly be interested in mathematics: I'd provide a mathematical problem and a baseline algorithm for it and would want an open source framework to be able to improve on that.
Had mentioned the same on X: https://x.com/friederrrr/status/1922850981181784152?t=usXpK1...
...but Waksman's algorithm from 1970 [1] multiplies two 4 x 4 complex-valued matrices using only 46 multiplications (indeed, it works in any ring admitting division by 2).
Sloppy by DeepMind and by Nature to publish such a claim - did they not ask someone knowledgeable about matrix multiplication to review the work?
1. Waksman's algorithm works in any commutative ring admitting division by 2.
2. In particular, it won't work when the matrix entries are themselves matrices, which means you can't use it recursively to get an algorithm for n-by-n matrices with large n with a better exponent than you get from Strassen's algorithm.
3. The Deep Mind paper is annoyingly unexplicit about whether the algorithm it reports has that property or not.
4. What they say about tensors suggests that their algorithm can be used recursively to do better than Strassen (but, note, there are other algorithms that are substantially better for very large n which using their algorithm recursively would very much not outperform) but it's possible I've misunderstood.
5. They explicitly talk about complex-valued matrices, but I think they don't mean "complex numbers as opposed to matrices, so you can't do this recursively" but "complex numbers as opposed to real numbers, so our algorithm doesn't get you a 4x4 matmul using 48 real multiplications".
I am not certain about points 4 and 5. The language in the paper is a bit vague. There may be supporting material with more details but I haven't looked.
2. Correct, however you can use Waksman as a basecase and always beat Strassen (though it is not asymptotically better of course).
5. Possible, but even so, there is already an algorithm that will work with 46 real multiplications (and some divisions by 2). The real numbers are commutative and admit division by 2.
Is this basically a merge of LLM's with genetic algorithm iteration?
The following algebraic point of view could be utter hogwash, so I might embarrass myself... but if you think about it, the "merge" operation is isomorphic to the product in a free commutative monoid (over a large number of generators, otherwise you can use Counting Sort). So sorting is all about computing a bunch of products (merges) in the optimal order. Now consider mergesort, insertion sort, Timsort.
- One lesson DeepMind drew from AlphaCode, AlphaTensor, and AlphaChip is that large‑scale pre‑training, combined with carefully chosen inductive biases, enables models to solve specialized problems at—or above—human performance.
- These systems still require curated datasets and experts who can hand‑design task‑specific pipelines.
- Conceptually, this work is an improved version of FunSearch (https://github.com/google-deepmind/funsearch/).
- In broad terms, FunSearch (and AlphaEvolve) follow three core design principles:
- Off‑the‑shelf LLMs can both generate code and recall domain knowledge. The “knowledge retrieval” stage may hallucinate, but—because the knowledge is expressed as code—we can execute it and validate the result against a custom evaluation function.
- Gradient descent is not an option for discrete code; a zeroth‑order optimizer—specifically evolutionary search—is required.
- During evolution we bias toward (1) _succinct_ programs and (2) _novel_ programs. Succinctness is approximated by program length; novelty is encouraged via a MAP‑Elites–style “novelty bias,” yielding a three‑dimensional Pareto frontier whose axes are _performance, simplicity,_ and _novelty_ (see e.g. OE‑Dreamer: (https://claireaoi.github.io/OE-Dreamer/).
Pros- Any general‑purpose foundation model can be coupled with evolutionary search.
- A domain expert merely supplies a Python evaluation function (with a docstring explaining domain‑specific details). Most scientists I've talked with - astronomers, seismologists, neuroscientists, etc. - already maintain such evaluation functions for their own code.
- The output is an interpretable program; even if it overfits or ignores a corner case, it often provides valuable insight into the regimes where it succeeds.
Cons
- Evolutionary search is compute‑heavy and LLM calls are slow unless heavily optimized. In my projects we need ≈ 60 k LLM calls per iteration to support a reasonable number of islands and populations. In equation discovery we offset cost by making ~99 % of mutations purely random; every extra 1 % of LLM‑generated mutations yields roughly a 10 % increase in high‑performing programs across the population.
- Evaluation functions typically undergo many refinement cycles; without careful curation the search may converge to a useless program that exploits loopholes in the metric.
Additional heuristics make the search practical. If your evaluator is slow, overlap it with LLM calls. To foster diversity, try dissimilar training: run models trained on different data subsets and let them compete. Interestingly, a smaller model (e.g., Llama-3 8 B) often outperforms a larger one (Llama‑3 70 B) simply because it emits shorter programs.
1. Why does it need a zeroth order optimizer?
2. Most GA's I've seen use thousands of solutions. Sometimes ten thousand or more. What leads you to use 60,000 calls per iteration?
3. How do you use populations and "islands?" I never studied using islands.
4. You said the smaller models are often better for "shorter" code. That makes sense. I've seen people extend the context of model with training passes. You think it would help to similarly shrink a larger model to a smaller context instead of using the small models?
1. Because we only have blackbox access to the LLM and the evaluation function might not be differentiable.
2. We're trying to search over the space of all programs in a programming language. To cover enough of this huge search space, we need to instantiate (1) a large number of programs in each population and (2) a large number of populations themselves (3) A large number of update steps for each population.
3. I have a couple of graphics motivating, conceptually, what an island/population looks like: https://trishullab.github.io/lasr-web/ . This whitesheet might also be useful: https://arxiv.org/abs/2305.01582
4. This is an interesting question. I believe so. However, my observations were derived from a non turing complete language (mathematical equations). There might be other ways of enforcing a succinctness pressure.
Can you give a concrete example of this? It's hard for me to conceptualize.
Let's say the model hallucinates in two directions:
1. "There is a trigonometric relationship between variable F and x". It expresses this as ``F = -C_1*sin(x)``. You fit the constant C_1 w.r.t the dataset, execute the program, and your best fit has a high error. You can discard the program.
2. "There is an inverse linear relationship between variable F and x". Now it expresses this as ``F = -C_1*x``. You fit the constant C_1 w.r.t the dataset, execute the program, and your best fit has extremely low error. You now know for sure that you're on the right track.
Anybody knows how they can guarantee uniqueness of searched snipped within code block or is it even possible?
Nowadays it makes much more sense to share less.
I honestly have no idea how AlphaEvolve works - does it work purely on the text level? Meaning I might be able to come up with something like AlphaEvolve with some EC2's and a Gemini API access?
I'm reading descriptions of agents and it just seems like the same tech deployed with authority to write and a scheduler
Packing problems are hard, and it is fun to see new interest in the area given these show up in weird places. =3
Just wait until the MBA's and politicians learn about this Adam Smith guy. A pipedream now, but maybe in the future schools will be inspired to teach about dialectical reasoning and rediscover Socrates.
[end of snark]
Sorry, I'm getting tired of ad-fueled corporations trying to get me to outsource critical thinking.
The problems in math and CS are more suitable for training LLMs?
Some of these questions change slightly, since we might end up with "unlimited resources" (i.e. instead of having e.g. 5 engineers on a team who can only get X done per sprint, we effectively have near-limitless compute to use instead) so maybe the answer is "build everything on the wish-list in 1 day" to the "what should we prioritize" type questions?
Interesting times.
My gut is that software engineers will end up as glorified test engineers, coming up with test cases (even if not actually writing the code) and asking the AI to write code until it passes.
That isn't every problem in software engineering.
> about: I believe in the creation of a machine god
Sounds about right.
The guy's making a prediction. Classifying it as some kind of religious zealotry isn't fair to his point or him.
If the cost of developing the software is 0, you can just build both.
The key ingredient for an intelligence explosion is AI accelerating development of AI.
This is it. It’s happening.
Remember this approach only works for exploring an optimization for an already defined behavior of a function which has an accordingly well defined evaluation metric.
You can't write an evaluation function for each individual piece of or general "intelligence"...
Either way, it's pretty wild.
I'm now no longer surprised just how consistently all the gemini models overcomplicate coding challenges or just plain get them wrong.
Claude is just consistently spot on. A few salient comments for tricky code instead of incessantly telling me what it's changed and what I might want to do, incorrect assumptions when it has the code or is something we've discussed, changing large amounts of unrelated code (eg styles). I could go on.
Shame I'm too tight to pay for Claude RN though...
You can try asking it to not do that, but I would bet it would slightly degrade code quality.
You can take the code and give it to another LLM instance and ask it to strip all comments.