Google ML Compiler Inlining Achieves 3-7% Reduction in Size (opens in new tab)

yunohn3y ago

Well, in this case, TFA:

> Better code optimization can significantly reduce the operational cost of large datacenter applications

They’re aiming to spend a bit more time compiling models, to reduce the scaled operational costs moving forward.

Someone3y ago· 5 in thread

The (current) title used on HN (“Google ML Compiler Inlining Achieves 3-7% Reduction in Size”) is confusing. It made me think Google is using ML (https://en.wikipedia.org/wiki/ML_(programming_language))

It also is not the one of the referred page, which is “MLGO: A Machine Learning Framework for Compiler Optimization”.

This is about an LLVM extension that uses Machine Learning.

I think it would be better to change the title here.

elromulous3y ago

I think for >99% of folks, ML means machine learning, even if you put compiler after it.

blululu3y ago

A quick google search on the subject is not so clear cut (try ML compiler and Compiler ML). In either case, this is a discussion of compilers where audience skews toward an interest in programming languages. In this context abbreviation is ambiguous.

I think a rename would be helpful in clearly outlining the subject. Something like 'Google Deep Learning based Compiler Achieves 3-7% reduction in size' would get the point across clearly.

siftrics3y ago

Eh, the OCaml crowd is surely reasonably large, and they would all be familiar with ML.

I had the same question as the parent commenter and had to wait until the page loaded before determining whether the article was about machine learning or the programming language.

[1] https://huyenchip.com/2021/09/07/a-friendly-introduction-to-...

Banana6993y ago

There is also the much larger confusion arising of whether it's an ML compiler in the sense of being a specialized compiler for ML code (compiles a description of a network to efficient ML-specific low-level IR to enable further optimization and code generation), or whether it's an ML compiler, a compiler (or part of it) that uses ML.

The first meaning is extremly common, the top 2 results for "ML Compiler" on google returns[1][2], both using it in the first sense. It's not that ML or ML-like techniques in compiler writing is new but the first sense is definitely at least as popular an interpretation.

[2] https://petewarden.com/2021/12/24/why-are-ml-compilers-so-ha...

davidatbu3y ago

100% agreed that the actual title is much better than the editorialized title.

localhost3y ago· 5 in thread

I remember listening to one of the Lex Fridman interviews with Jim Keller where he said that modern branch prediction in CPUs was now done by “neural nets” in silicon. Does anyone here have any insight into this?

brrrrrm3y ago

I think it predates that nomenclature. Google around for "perceptrons" instead.

https://www.cs.utexas.edu/~lin/papers/hpca01.pdf

flakiness3y ago

Alternative google scholar link: https://scholar.google.com/scholar?cluster=11765959727984075...

tempodox3y ago

That link is broken.

NavinF3y ago

If you search “AMD Zen neural branch prediction” you’ll find several references to this, but very few details. I dunno if perceptrons really count as “neural nets” these days since it’s just a weighted average. Lots of computations are technically neural nets if we use such broad definitions.

It’d be pretty cool to see an x86 language model in future CPUs. I have no doubt that compute will continue to scale faster than memory access and the relative cost of pipeline stalls will never go down.

nickysielicki3y ago

https://pages.cs.wisc.edu/~sinclair/courses/cs752/fall2020/h...

yablak3y ago· 5 in thread

... and 1-2% improvement in performance via the register allocator.

changoplatanero3y ago

If google has 10 million servers then a 2% improvement would be like freeing up 200,000 servers. That's a significant amount of value added!

DannyBee3y ago

It usually makes more sense to think in terms of cores rather than servers, because server density has always been increasing :)

kevincox3y ago

If Google has 100 data centers they can sell 2 of them.

vardump3y ago

In this day and age, that's a pretty large compiled code performance improvement from just one feature.

Across the board improvements from changing register allocators is surprisingly difficult to get. Couple of percent sounds great.

It's usually done heuristically in a fashion that scales badly with how many instructions you're considering at a time so it's really easy to overfit to today's benchmark.

astrange3y ago· 5 in thread

Would be interesting to compare with gcc, which has a better implementation of both inlining and register allocation.

DannyBee3y ago

Define "better".

I'm very familiar with both inlining and register allocation in GCC (I even helped write one of the register allocation rewrite attempts, until vlad went even further down the rabbit hole of writing register allocators than we did)

RA was itself is historically more advanced in amount of optimization it does itself in GCC, but was still mostly a wash, because it's still much easier to work with LLVM's form and better optimizers were written.

LLVM also nowadays has a fairly advanced register allocator if you want it (PBQP) - it produces near optimal register allocations. It's not used though. It hasn't been worth it.

Inlining, they are not usefully comparable - they are very different approaches, both are good, and during the last inliner rewrite in LLVM, large amounts of comparisons were done a large number of times, and there isn't anything to suggest that GCC's inliner is really better (unlike RA, where the default GCC algorithm is certainly more advanced than the default LLVM RA, despite this being a deliberate decision).

We spent a lot of time optimizing inlining and RA/etc code when we transitioned to LLVM at Google (many years ago now)

At when we made that transition it was a net positive thing for fleet, and it was on a very large codebase.

The truth about compilers is there are no silver bullets. Things get faster mostly through careful tuning (which ML will hopefully replace), and much more rarely due to better algorithms.

As a result, i would expect GCC to not do any better here than LLVM.

moonchild3y ago

I still want to see the estimated distance heuristic (https://dspace.library.uvic.ca/bitstream/handle/1828/7107/Bu...) in a modern compiler.

DannyBee3y ago

Variants of this have been tried. What are you hoping for?

mhh__3y ago

The thing is that GCC successfully shot its own head off via Stallman so no one cares about doing research with it anymore.

The licence doesn't help but most of this work is open source anyway

l33t23283y ago

What do you mean about GCC shooting itself in the head?

What do you mean about its license?

mhh__3y ago· 3 in thread

Using ML inside compilers has a lot of untapped potential I think.

People think of a compiler as an AI when they're actually very stupid in terms of the number of decisions available to them.

Feedback is the lifeblood of intelligent performance, it is more than possible to fake that feedback using AI.

E.g. Your error callback is on the balance of probability going to be called less than the (say) core matrix multiply loop etc. Etc.

viraptor3y ago

Same for gc and database parameters. You can still get decent gains tweaking those and ml could help.

I also played around with using ml to optimise auto scaling of CI instances. (taking time of day and queue sizes into account)

brrrrrm3y ago

you might be interested in CompilerGym https://compilergym.com

titzer3y ago

> when they're actually very stupid in terms of the number of decisions available to them.

I don't think this is a fair characterization, though I agree overall that ML has a lot of potential in compiler optimization.

noajshu3y ago· 2 in thread

I wonder how the performance improvements compare to PGO.

yablak3y ago

These are on top of PGO, so SOTA.

RhysU3y ago

Actually looking at the code? Heresy!

(Sincerely, this should be the baseline for every bit of research.)

pabs33y ago· 2 in thread

I wonder if the ML is deterministic or not. Otherwise you compile twice and get completely different binaries.

It'll be 100% deterministic.

Reproducibility is a big part of Google's internal build system, and they wouldn't be able to deploy something that broke that.

pabs33y ago

What about reproducibility of the compiler binaries? I got the impression that model training itself isn't deterministic across training hardware? Since the model is embedded in the binaries, then the binaries aren't really reproducible if the model isn't. How big is the data used to train the model? How costly is it to train the model?

3 more replies

jamesfinlayson3y ago· 2 in thread

Neat.

It seems like it's sort of built in to LLVM right now, but not usable unless you build LLVM yourself with pretrained models as part of the build process?

How big are the models? If just a few megabytes, why not just check them into the llvm git repo?

jamesfinlayson3y ago

I'm not sure - https://github.com/google/ml-compiler-opt#pretrained-models says the they sometimes get released on GitHub but I couldn't see anything obvious in that repo.

[0]: https://old.reddit.com/r/rust/comments/k9r3s4/fuchsia_lines_...

saagarjha3y ago· 2 in thread

while less important, I'm curious what the impact was on compile times.

DannyBee3y ago

I would imagine it was essentially nil, if not an occasional speedup.

For one thing, the heuristics it replaces are not always super-cheap to compute.

yablak3y ago

Negligible

mcint3y ago· 1 in thread

They trained (and report) two optimization strategies:

- inline-for-size

  We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction.

- regalloc

  with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications

  Try it Yourself
  Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.

  https://github.com/google/ml-compiler-opt

  https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md

With code, that's awesome—what I like to see.

shostack3y ago

For those not familiar with the space, how significant an impact is this, particularly at Google scale?

est313y ago· 1 in thread

I wonder how well it performs on Rust code, which is probably different in terms of patterns from C++ code. They mention the Fuchsia project but they seem to have focused on its C++ components. There is actually more Rust inside Fuchsia now than C++ [0].

I also wonder how this would look like for mainlining. Should the LLVM project depend on tensorflow now? IIRC tensorflow itself depends on LLVM so to avoid circular dependencies, does there have to be a ML-free version of LLVM that tensorflow depends on, which is then used by the proper LLVM? Or can inference be converted into simple C like it was for lpcnet? Lastly, there is the general question of integration of ML models into open source projects. Say it is merged and the old manual heuristic is deleted. What if Google one day decides they don't want to maintain the component any more? Can LLVM maintainers do any refactors of the code around it? Unless Google also shares their training infrastructure, LLVM maintainers can't re-train the model on the post-refactor data.

protomolecule3y ago

>Should the LLVM project depend on tensorflow now?

This bit from the article seem to be relevant.

"The TensorFlow model is embedded with XLA AOT, which converts the model into executable code. This avoids TensorFlow runtime dependency"

aidenn03y ago· 1 in thread

I remember a while back reading a paper about using superoptimizers to auto-generate peephole optimizations; anyone know if that ever got deployed anywhere?

I think some of llvm's ones are manually created from exhaustive search. Related project is Alive for testing if the peephole transform is valid.

avar3y ago· 1 in thread

Isn't this always going to be less efficient than profile guided optimization?

Of course that assumes you can profile your program with realistic "production" workloads, and it'll need two compilation passes, so having sensible ML defaults boxes sense.

But it's odd not to compare this to PGO in terms of the resulting performance.

jeffbee3y ago

It's not odd because the result is on top of PGO. You would never adopt this approach if you did not already have PGO. The gains from PGO will be much larger.

stonemetal123y ago· 1 in thread

As someone who doesn't follow compilers too much, is 3-7% considered good? Throwing a whole ML model in to the compiler seems like a pretty large boost to complexity for a small gain in performance.

kevincox3y ago

If you can fit 5% more functionality into a fixed size that is nice. Especially if you are only using this optimization level for the final production build that needs to be squeezed onto a small (and therefore cheaper) ROM.

Although I do agree that these days most projects will have media that takes up the majority of the space.

The stat that seems clearly impressive for me is that their register allocator claims "0.3% ~1.5% improvements in queries per second" which is a huge cost savings for operations at the scale of Google. If you have 100 datacenters of software running you can conceivably turn one of them off. (Or more likely slow down future building and expansion plans). Of course for most people compute costs aren't a significant expense.

alophawen3y ago· 1 in thread

> modem computers

OK.

EdSchouten3y ago

OCR mistake; happens all the time. The author of the blog post most likely faxed it to the person in charge of the Google AI Blog.

Provocative title. Function inlining mixes up call overhead elision and specialisation to the call site, especially if the compiler hasn't implemented the specialisation as it's own thing.

Inlining by itself only decreases code size if there's provably a single call site. What's doing the work here is branch folding after the inline.

Alternative is specialising the function wrt the predicate and rewriting call sites to the specialisation when the predicate is known.

Harder than inline branchy things and hope for the best but tractable and bounds the code size increase.

titzer3y ago

I wonder how tied the heuristic is to a particular compiler IR. E.g. can the decision procedure be boiled down to a form that could be plugged into other compilers that need an inlining heuristic?

jfbaro3y ago

Is this the start of a future where we can write high level code (Idris, Agda, Coq) and the resulting code will run as fast (and as safe) as RUST? Interesting.

j / k navigate · click thread line to collapse

81 comments

62 comments · 19 top-level

tadeegan3y ago· 6 in thread

titzer3y ago

quotemstr3y ago

In general, every heuristic in systems programming --- kernels z compilers, databases, whatever --- is an opportunity to substitute an ML system.

brrrrrm3y ago

today, learned heuristics have a couple of pitfalls that make them hard to add to such systems

1. they are usually hard to run efficiently

2. they are usually hard to explain

2 more replies

zasdffaa3y ago

I guess local inlining might sometimes be an unconditional win, but even then only under specific circumstances.

(disclaimer: I know something but am not an expert)

xdfgh11123y ago

Profile guided optimisation is a thing.

yunohn3y ago

Well, in this case, TFA:

> Better code optimization can significantly reduce the operational cost of large datacenter applications

They’re aiming to spend a bit more time compiling models, to reduce the scaled operational costs moving forward.

Someone3y ago· 5 in thread

It also is not the one of the referred page, which is “MLGO: A Machine Learning Framework for Compiler Optimization”.

This is about an LLVM extension that uses Machine Learning.

I think it would be better to change the title here.

elromulous3y ago

I think for >99% of folks, ML means machine learning, even if you put compiler after it.

blululu3y ago

I think a rename would be helpful in clearly outlining the subject. Something like 'Google Deep Learning based Compiler Achieves 3-7% reduction in size' would get the point across clearly.

siftrics3y ago

Eh, the OCaml crowd is surely reasonably large, and they would all be familiar with ML.

I had the same question as the parent commenter and had to wait until the page loaded before determining whether the article was about machine learning or the programming language.

[1] https://huyenchip.com/2021/09/07/a-friendly-introduction-to-...

Banana6993y ago

[2] https://petewarden.com/2021/12/24/why-are-ml-compilers-so-ha...

davidatbu3y ago

100% agreed that the actual title is much better than the editorialized title.

localhost3y ago· 5 in thread

brrrrrm3y ago

I think it predates that nomenclature. Google around for "perceptrons" instead.

https://www.cs.utexas.edu/~lin/papers/hpca01.pdf

flakiness3y ago

Alternative google scholar link: https://scholar.google.com/scholar?cluster=11765959727984075...

tempodox3y ago

That link is broken.

NavinF3y ago

nickysielicki3y ago

https://pages.cs.wisc.edu/~sinclair/courses/cs752/fall2020/h...

yablak3y ago· 5 in thread

... and 1-2% improvement in performance via the register allocator.

changoplatanero3y ago

If google has 10 million servers then a 2% improvement would be like freeing up 200,000 servers. That's a significant amount of value added!

DannyBee3y ago

It usually makes more sense to think in terms of cores rather than servers, because server density has always been increasing :)

kevincox3y ago

If Google has 100 data centers they can sell 2 of them.

vardump3y ago

In this day and age, that's a pretty large compiled code performance improvement from just one feature.

Across the board improvements from changing register allocators is surprisingly difficult to get. Couple of percent sounds great.

It's usually done heuristically in a fashion that scales badly with how many instructions you're considering at a time so it's really easy to overfit to today's benchmark.

astrange3y ago· 5 in thread

Would be interesting to compare with gcc, which has a better implementation of both inlining and register allocation.

DannyBee3y ago

Define "better".

LLVM also nowadays has a fairly advanced register allocator if you want it (PBQP) - it produces near optimal register allocations. It's not used though. It hasn't been worth it.

We spent a lot of time optimizing inlining and RA/etc code when we transitioned to LLVM at Google (many years ago now)

At when we made that transition it was a net positive thing for fleet, and it was on a very large codebase.

The truth about compilers is there are no silver bullets. Things get faster mostly through careful tuning (which ML will hopefully replace), and much more rarely due to better algorithms.

As a result, i would expect GCC to not do any better here than LLVM.

moonchild3y ago

I still want to see the estimated distance heuristic (https://dspace.library.uvic.ca/bitstream/handle/1828/7107/Bu...) in a modern compiler.

DannyBee3y ago

Variants of this have been tried. What are you hoping for?

mhh__3y ago

The thing is that GCC successfully shot its own head off via Stallman so no one cares about doing research with it anymore.

The licence doesn't help but most of this work is open source anyway

l33t23283y ago

What do you mean about GCC shooting itself in the head?

What do you mean about its license?

mhh__3y ago· 3 in thread

Using ML inside compilers has a lot of untapped potential I think.

People think of a compiler as an AI when they're actually very stupid in terms of the number of decisions available to them.

Feedback is the lifeblood of intelligent performance, it is more than possible to fake that feedback using AI.

E.g. Your error callback is on the balance of probability going to be called less than the (say) core matrix multiply loop etc. Etc.

viraptor3y ago

Same for gc and database parameters. You can still get decent gains tweaking those and ml could help.

I also played around with using ml to optimise auto scaling of CI instances. (taking time of day and queue sizes into account)

brrrrrm3y ago

you might be interested in CompilerGym https://compilergym.com

titzer3y ago

> when they're actually very stupid in terms of the number of decisions available to them.

I don't think this is a fair characterization, though I agree overall that ML has a lot of potential in compiler optimization.

noajshu3y ago· 2 in thread

I wonder how the performance improvements compare to PGO.

yablak3y ago

These are on top of PGO, so SOTA.

RhysU3y ago

Actually looking at the code? Heresy!

(Sincerely, this should be the baseline for every bit of research.)

pabs33y ago· 2 in thread

I wonder if the ML is deterministic or not. Otherwise you compile twice and get completely different binaries.

It'll be 100% deterministic.

Reproducibility is a big part of Google's internal build system, and they wouldn't be able to deploy something that broke that.

pabs33y ago

3 more replies

jamesfinlayson3y ago· 2 in thread

Neat.

It seems like it's sort of built in to LLVM right now, but not usable unless you build LLVM yourself with pretrained models as part of the build process?

How big are the models? If just a few megabytes, why not just check them into the llvm git repo?

jamesfinlayson3y ago

I'm not sure - https://github.com/google/ml-compiler-opt#pretrained-models says the they sometimes get released on GitHub but I couldn't see anything obvious in that repo.

[0]: https://old.reddit.com/r/rust/comments/k9r3s4/fuchsia_lines_...

saagarjha3y ago· 2 in thread

while less important, I'm curious what the impact was on compile times.

DannyBee3y ago

I would imagine it was essentially nil, if not an occasional speedup.

For one thing, the heuristics it replaces are not always super-cheap to compute.

yablak3y ago

Negligible

mcint3y ago· 1 in thread

They trained (and report) two optimization strategies:

- inline-for-size

  We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction.

- regalloc

  with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications

  Try it Yourself
  Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.

  https://github.com/google/ml-compiler-opt

  https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md

With code, that's awesome—what I like to see.

shostack3y ago

For those not familiar with the space, how significant an impact is this, particularly at Google scale?

est313y ago· 1 in thread

protomolecule3y ago

>Should the LLVM project depend on tensorflow now?

This bit from the article seem to be relevant.

"The TensorFlow model is embedded with XLA AOT, which converts the model into executable code. This avoids TensorFlow runtime dependency"

aidenn03y ago· 1 in thread

I remember a while back reading a paper about using superoptimizers to auto-generate peephole optimizations; anyone know if that ever got deployed anywhere?

I think some of llvm's ones are manually created from exhaustive search. Related project is Alive for testing if the peephole transform is valid.

avar3y ago· 1 in thread

Isn't this always going to be less efficient than profile guided optimization?

Of course that assumes you can profile your program with realistic "production" workloads, and it'll need two compilation passes, so having sensible ML defaults boxes sense.

But it's odd not to compare this to PGO in terms of the resulting performance.

jeffbee3y ago

It's not odd because the result is on top of PGO. You would never adopt this approach if you did not already have PGO. The gains from PGO will be much larger.

stonemetal123y ago· 1 in thread

As someone who doesn't follow compilers too much, is 3-7% considered good? Throwing a whole ML model in to the compiler seems like a pretty large boost to complexity for a small gain in performance.

kevincox3y ago

Although I do agree that these days most projects will have media that takes up the majority of the space.

alophawen3y ago· 1 in thread

> modem computers

OK.

EdSchouten3y ago

OCR mistake; happens all the time. The author of the blog post most likely faxed it to the person in charge of the Google AI Blog.