- inline-for-size
We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction.
- regalloc with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications
Try it Yourself
Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.
https://github.com/google/ml-compiler-opt
https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md
With code, that's awesome—what I like to see.1. they are usually hard to run efficiently
2. they are usually hard to explain
The former is definitely changing with low precision formats like fp16 and useful coprocessors that can do matrix multiplications efficiently (M1, Intel). The latter hasn't been developed much and unless you're just training a model to memorize the entire space the heuristic operates in, it can be scary to trust it on unseen data.
Analogy: otherwise you're just optimising the design of a car. But optimising it for what? speed, efficiency, reliability, price, weight, carrying capacity... You first need to know how it's expected to be used.
I guess local inlining might sometimes be an unconditional win, but even then only under specific circumstances.
(disclaimer: I know something but am not an expert)
> Better code optimization can significantly reduce the operational cost of large datacenter applications
They’re aiming to spend a bit more time compiling models, to reduce the scaled operational costs moving forward.
It also is not the one of the referred page, which is “MLGO: A Machine Learning Framework for Compiler Optimization”.
This is about an LLVM extension that uses Machine Learning.
I think it would be better to change the title here.
I think a rename would be helpful in clearly outlining the subject. Something like 'Google Deep Learning based Compiler Achieves 3-7% reduction in size' would get the point across clearly.
I had the same question as the parent commenter and had to wait until the page loaded before determining whether the article was about machine learning or the programming language.
The first meaning is extremly common, the top 2 results for "ML Compiler" on google returns[1][2], both using it in the first sense. It's not that ML or ML-like techniques in compiler writing is new but the first sense is definitely at least as popular an interpretation.
[1] https://huyenchip.com/2021/09/07/a-friendly-introduction-to-...
[2] https://petewarden.com/2021/12/24/why-are-ml-compilers-so-ha...
People think of a compiler as an AI when they're actually very stupid in terms of the number of decisions available to them.
Feedback is the lifeblood of intelligent performance, it is more than possible to fake that feedback using AI.
E.g. Your error callback is on the balance of probability going to be called less than the (say) core matrix multiply loop etc. Etc.
I also played around with using ml to optimise auto scaling of CI instances. (taking time of day and queue sizes into account)
I don't think this is a fair characterization, though I agree overall that ML has a lot of potential in compiler optimization.
I also wonder how this would look like for mainlining. Should the LLVM project depend on tensorflow now? IIRC tensorflow itself depends on LLVM so to avoid circular dependencies, does there have to be a ML-free version of LLVM that tensorflow depends on, which is then used by the proper LLVM? Or can inference be converted into simple C like it was for lpcnet? Lastly, there is the general question of integration of ML models into open source projects. Say it is merged and the old manual heuristic is deleted. What if Google one day decides they don't want to maintain the component any more? Can LLVM maintainers do any refactors of the code around it? Unless Google also shares their training infrastructure, LLVM maintainers can't re-train the model on the post-refactor data.
[0]: https://old.reddit.com/r/rust/comments/k9r3s4/fuchsia_lines_...
This bit from the article seem to be relevant.
"The TensorFlow model is embedded with XLA AOT, which converts the model into executable code. This avoids TensorFlow runtime dependency"
It’d be pretty cool to see an x86 language model in future CPUs. I have no doubt that compute will continue to scale faster than memory access and the relative cost of pipeline stalls will never go down.
It's usually done heuristically in a fashion that scales badly with how many instructions you're considering at a time so it's really easy to overfit to today's benchmark.
I'm very familiar with both inlining and register allocation in GCC (I even helped write one of the register allocation rewrite attempts, until vlad went even further down the rabbit hole of writing register allocators than we did)
RA was itself is historically more advanced in amount of optimization it does itself in GCC, but was still mostly a wash, because it's still much easier to work with LLVM's form and better optimizers were written.
LLVM also nowadays has a fairly advanced register allocator if you want it (PBQP) - it produces near optimal register allocations. It's not used though. It hasn't been worth it.
Inlining, they are not usefully comparable - they are very different approaches, both are good, and during the last inliner rewrite in LLVM, large amounts of comparisons were done a large number of times, and there isn't anything to suggest that GCC's inliner is really better (unlike RA, where the default GCC algorithm is certainly more advanced than the default LLVM RA, despite this being a deliberate decision).
We spent a lot of time optimizing inlining and RA/etc code when we transitioned to LLVM at Google (many years ago now)
At when we made that transition it was a net positive thing for fleet, and it was on a very large codebase.
The truth about compilers is there are no silver bullets. Things get faster mostly through careful tuning (which ML will hopefully replace), and much more rarely due to better algorithms.
As a result, i would expect GCC to not do any better here than LLVM.
The licence doesn't help but most of this work is open source anyway
What do you mean about its license?
Reproducibility is a big part of Google's internal build system, and they wouldn't be able to deploy something that broke that.
Inlining by itself only decreases code size if there's provably a single call site. What's doing the work here is branch folding after the inline.
Alternative is specialising the function wrt the predicate and rewriting call sites to the specialisation when the predicate is known.
Harder than inline branchy things and hope for the best but tractable and bounds the code size increase.
It seems like it's sort of built in to LLVM right now, but not usable unless you build LLVM yourself with pretrained models as part of the build process?
Of course that assumes you can profile your program with realistic "production" workloads, and it'll need two compilation passes, so having sensible ML defaults boxes sense.
But it's odd not to compare this to PGO in terms of the resulting performance.
Although I do agree that these days most projects will have media that takes up the majority of the space.
The stat that seems clearly impressive for me is that their register allocator claims "0.3% ~1.5% improvements in queries per second" which is a huge cost savings for operations at the scale of Google. If you have 100 datacenters of software running you can conceivably turn one of them off. (Or more likely slow down future building and expansion plans). Of course for most people compute costs aren't a significant expense.
OK.