AITemplate, a revolutionary new inference engine by Meta AI (opens in new tab)

(ai.facebook.com)

73 pointsazurezyq3y ago35 comments

35 comments

35 comments · 8 top-level

ipiszy3y ago· 13 in thread

tl;dr:

Meta is open sourcing AITemplate, an inference engine for both Nvidia and AMD GPUs. Code: https://github.com/facebookincubator/AITemplate.

AITemplate delivers much better perf (1.9x ~ 12.8x) compared to PyTorch eager on SOTA models, including Bert, ResNet, VIT and StableDiffusion.

AITemplate also delivers high perf numbers using AMD GPUs (MI-250). With AITemplate, MI-250 achieves 80% ~ 96% A100 perf on various ResNet / Bert / VIT models.

AITemplate uses sophisticated fusion techniques to optimize perf, including vertical, horizontal, and memory fusions.

btw, I'm one of the authors of AITemplate, happy to answer any questions.

Narew3y ago

How did AITemplate performance to state of art inference engine like tvm or onnx runtime ? Did AITemplate optimize/quantify network?

Edit: link for TVM https://tvm.apache.org/

ipiszy3y ago

AITemplate only supports fp16 data types with fp16 or fp32 accumulation right now. We are working on supporting more data types and quantization.

We don't have an official comparison between AITemplate and tvm / onnx for now, but we do have perf numbers like https://github.com/facebookincubator/AITemplate/tree/main/ex..., https://github.com/facebookincubator/AITemplate/tree/main/ex.... Feel free to run these examples on other frameworks and compare perf.

davidatbu3y ago

I'd love to hear about this too: especially after running the model through an onnx optimizer, like this one [0].

[0] https://github.com/daquexian/onnx-simplifier

throwaway815233y ago

Thanks, that is very helpful. Do you have to train the model differently for use with AITemplate? Could it be helpful for Leela Chess Zero (LC0)? I think LC0 has a generic Pytorch backend, that is several times slower than its NVidia specific CUDA backend. I'm not very clueful about this stuff though.

haolu73y ago

No, you don't need to train the model differently to use it with AITemplate. Here is an intro example to do inference with AITemplate with a very simple PyTorch model: https://facebookincubator.github.io/AITemplate/tutorial/how_.... For more advanced examples, check out https://github.com/facebookincubator/AITemplate/tree/main/ex...

ipiszy3y ago

As @haolu7 mentioned, you could take a pre-trained model and use AITemplate to do model inference. All you need to do is to re-write the model using AITemplate frontend and map PyTorch params to AITemplate params. Besides, AITemplate has a limited operator coverage compared to mature frameworks like PyTorch so you may need to implement your own kernels if necessary (though it already supports Bert, VIT, StableDiffusion, ResNet, Detectron, and general recommendation models).

fooblaster3y ago

How does the performance compare with tensor rt? I didn't see any benchmarks comparing against that. I expect it to be lower for now, but excited for see what the future brings.

upbeat_general3y ago

Do you know of any good explanations of the techniques you used for those who only touch PyTorch Eager + occasionally torchscript?

ipiszy3y ago

You could check "AITemplate optimizations" section in the blog (https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd...), and https://github.com/facebookincubator/AITemplate#more-about-a.... The basic idea is to do aggressive kernel fusions.

papersnake3y ago

Have you tested this on big models involving multi-gpu communication, or any plans?

ipiszy3y ago

For now it's for single GPU inference only.

pretty_dumm_guy3y ago

How do you verify the correctness of your fusion operation ?

ipiszy3y ago

We have a bunch of unittests and E2E tests to compare numeric numbers between AITemplate and PyTorch eager.

haolu73y ago· 9 in thread

AITemplate-PyTorch Stable Diffusion is the fastest stable diffusion inference solution by pushing image generation below one second on A100 (batch 1: 0.7s / 25 steps, 1.3s / 50 steps; batch 3: 1.6s / 25 steps, per image 0.55s; batch 16 7.9s / 25 steps, per image 0.49s) for the first time, 2.57X faster than Keras' XLA-based GPU compilation solution.

More benchmark numbers and repro at: https://github.com/facebookincubator/AITemplate/tree/main/ex...

Llamamoe3y ago

Wow. Considering that with the better samplers you can reduce steps to 10-15, this is getting close to near-instant results.

One or two more optimizations and we're gonna have live-update results.

tveita3y ago

This lists "OOM" for PyTorch on a RTX 3080-10GB, but I believe people have optimized the PyTorch SD model to run on even 6GiB GPUs.

Would AITemplate be able to run with those constraints?

ipiszy3y ago

RTX 3080-10GB should work. You could check https://github.com/facebookincubator/AITemplate/tree/main/ex..., and https://www.reddit.com/r/StableDiffusion/comments/xv7m89/met....

PresentHarmony3y ago

Or if you count in another way. In one second, how many pictures it will be able to generate, with these parameters. It could be 1.05, 1.1, or say 1.5 or even 2 pictures. Thank you very much for your post! I will be very grateful for the answer!

PresentHarmony3y ago

Can you please eloborate, how many milliseconds does it take to generate 1 image with these wonderful improvements? I will be very grateful for your answer! Thank you very much!

PresentHarmony3y ago

Do I get it right, it takes 0.55 second or 0.49 second to generate an image depending on the batch?

Thank you so much for your post! I would be very grateful for the response!

ipiszy3y ago

Yes this is correct. batch 16 7.9s / 25 steps, per image 0.49s: it generates 16 images for each prompt within 7.9s, so it's 0.49s per image.

PresentHarmony3y ago

One more question, if you don't mind. 1 image is generated in 0.7 seconds (25 steps ) and the same single image with 50 steps will be generated in 1.3 seconds. So it's much cheaper to generate more images for the same promt. Am I right or am I missing something ? Thanks in advance for your answer.

P.S. Though it should be 1.4 seconds. 0.7*2=14.If you think twice the speps, twice the time.

PresentHarmony3y ago

Thank you indeed, my friend!

ghoomketu3y ago· 2 in thread

For all the hate that Facebook gets their only redeeming quality is these open source projects they have been releasing all along.

Maybe this is to attract better engineers but all in all this has been a net postive for software development. So credit where it is due.

version_five3y ago

Yes, it's hard to know in the balance whether the net contribution of these advertising companies (fb and google mainly) is a net positive, but their contribution to ML research is unmatched and has created an insane amount of value (I'd speculate rivaling their market caps but someone can probably prove me wrong) in business and research that uses the tools they've built.

ETH_start3y ago

The net impact of these companies is massively positive. Facebook, with its trust-engendering social graph, enables huge numbers of businesses and social groups to exist that otherwise couldn't while Google has enabled so much information discovery that we just take for granted now.

Of course I would argue there's a better way to provide these kinds of services that concentrates power less, and that's decentralization with cryptoeconomic incentives to maintain consensus, but for their generation, they did well.

azurezyqOP3y ago· 1 in thread

https://github.com/facebookincubator/AITemplate

yinghai833y ago

Very impressive results!

devcat3y ago· 1 in thread

Sadly it doesn't have Apple GPU backend

mbroncano3y ago

It mentions it is in the works

throwaway815233y ago· 1 in thread

Tldr?

theflyingelvis3y ago

Unfortunately your comment was too long. I didn’t read it. Try being more succinct next time.

house_road3y ago

It supports both Nvidia and AMD, and both got pretty good speedup. This is a great achievement!

enoch20903y ago

How would this perform compared with Tensorflow?

j / k navigate · click thread line to collapse

35 comments

35 comments · 8 top-level

ipiszy3y ago· 13 in thread

tl;dr:

Meta is open sourcing AITemplate, an inference engine for both Nvidia and AMD GPUs. Code: https://github.com/facebookincubator/AITemplate.

AITemplate delivers much better perf (1.9x ~ 12.8x) compared to PyTorch eager on SOTA models, including Bert, ResNet, VIT and StableDiffusion.

AITemplate also delivers high perf numbers using AMD GPUs (MI-250). With AITemplate, MI-250 achieves 80% ~ 96% A100 perf on various ResNet / Bert / VIT models.

AITemplate uses sophisticated fusion techniques to optimize perf, including vertical, horizontal, and memory fusions.

btw, I'm one of the authors of AITemplate, happy to answer any questions.

Narew3y ago

How did AITemplate performance to state of art inference engine like tvm or onnx runtime ? Did AITemplate optimize/quantify network?

Edit: link for TVM https://tvm.apache.org/

ipiszy3y ago

AITemplate only supports fp16 data types with fp16 or fp32 accumulation right now. We are working on supporting more data types and quantization.

davidatbu3y ago

I'd love to hear about this too: especially after running the model through an onnx optimizer, like this one [0].

[0] https://github.com/daquexian/onnx-simplifier

throwaway815233y ago

haolu73y ago

ipiszy3y ago

fooblaster3y ago

How does the performance compare with tensor rt? I didn't see any benchmarks comparing against that. I expect it to be lower for now, but excited for see what the future brings.

upbeat_general3y ago

Do you know of any good explanations of the techniques you used for those who only touch PyTorch Eager + occasionally torchscript?

ipiszy3y ago

papersnake3y ago

Have you tested this on big models involving multi-gpu communication, or any plans?

ipiszy3y ago

For now it's for single GPU inference only.

pretty_dumm_guy3y ago

How do you verify the correctness of your fusion operation ?

ipiszy3y ago

We have a bunch of unittests and E2E tests to compare numeric numbers between AITemplate and PyTorch eager.

haolu73y ago· 9 in thread

More benchmark numbers and repro at: https://github.com/facebookincubator/AITemplate/tree/main/ex...

Llamamoe3y ago

Wow. Considering that with the better samplers you can reduce steps to 10-15, this is getting close to near-instant results.

One or two more optimizations and we're gonna have live-update results.

tveita3y ago

This lists "OOM" for PyTorch on a RTX 3080-10GB, but I believe people have optimized the PyTorch SD model to run on even 6GiB GPUs.

Would AITemplate be able to run with those constraints?

ipiszy3y ago

RTX 3080-10GB should work. You could check https://github.com/facebookincubator/AITemplate/tree/main/ex..., and https://www.reddit.com/r/StableDiffusion/comments/xv7m89/met....

PresentHarmony3y ago

Can you please eloborate, how many milliseconds does it take to generate 1 image with these wonderful improvements? I will be very grateful for your answer! Thank you very much!

PresentHarmony3y ago

Do I get it right, it takes 0.55 second or 0.49 second to generate an image depending on the batch?

Thank you so much for your post! I would be very grateful for the response!

ipiszy3y ago

Yes this is correct. batch 16 7.9s / 25 steps, per image 0.49s: it generates 16 images for each prompt within 7.9s, so it's 0.49s per image.

PresentHarmony3y ago

P.S. Though it should be 1.4 seconds. 0.7*2=14.If you think twice the speps, twice the time.

PresentHarmony3y ago

Thank you indeed, my friend!

ghoomketu3y ago· 2 in thread

For all the hate that Facebook gets their only redeeming quality is these open source projects they have been releasing all along.

Maybe this is to attract better engineers but all in all this has been a net postive for software development. So credit where it is due.

version_five3y ago

ETH_start3y ago

azurezyqOP3y ago· 1 in thread

https://github.com/facebookincubator/AITemplate

yinghai833y ago

Very impressive results!

devcat3y ago· 1 in thread

Sadly it doesn't have Apple GPU backend

mbroncano3y ago

It mentions it is in the works

throwaway815233y ago· 1 in thread

Tldr?

theflyingelvis3y ago

Unfortunately your comment was too long. I didn’t read it. Try being more succinct next time.

house_road3y ago

It supports both Nvidia and AMD, and both got pretty good speedup. This is a great achievement!

enoch20903y ago

How would this perform compared with Tensorflow?

j / k navigate · click thread line to collapse