Run Llama2-70B in Web Browser with WebGPU Acceleration (opens in new tab)

(webllm.mlc.ai)

9 pointsruihangl2y ago6 comments

6 comments

6 comments · 2 top-level

Apache TVM is super cool in theory. Its fast thanks to the autotuning, and it supports tons of backends like Vulkan, Metal, WASM + WebGPU, fpgas, weird mobile accelerators and such. It supports quantization, dynamism and other cool features.

But... It isn't used much outside MLC? And MLC's implementations are basically demos.

I dunno why. AI inference communities are dying for fast multiplatform backends without the fuss of PyTorch.

crowwork2y ago

Checkout the latest docs https://mlc.ai/mlc-llm/docs/ MLC started with demos and it evolved lately, with API integrations, documentations into an inference solution that everyone can reuse for universal deployments

brucethemoose22y ago

Its been awhile since I looked into this, thanks.

As a random aside, I hope y'all publish a SDXL repo for local (non webgpu) inference. SDXL is too compute heavy to split/offload to cpu like Llama.cpp, but less ram heavy than llms, and I'm thinking it would benefit from TVM's "easy" quantization.

It would be a great backend to hook into the various web UIs, maybe with the secondary model loaded on an IGP.

junrushao19942y ago

I don't think TVM advertised a lot on its full capabilities, for example, high-perf codegen for dynamic shapes without auto-tuning, or auto-tuning-based codegen, at least in the past few years, and that might be one of the factors it doesn't got a lot of visibility.

brucethemoose22y ago

I think this is true of AI compilation in general. Torch MLIR, AITemplate and really everything here fly under the radar:

https://github.com/merrymercy/awesome-tensor-compilers#open-...

ruihanglOP2y ago

Purely running in web browser. Generating 6.2 tok/s on Apple M2 Ultra with 64GB of memory.

j / k navigate · click thread line to collapse

6 comments

6 comments · 2 top-level

brucethemoose22y ago· 4 in thread

Apache TVM is super cool in theory. Its fast thanks to the autotuning, and it supports tons of backends like Vulkan, Metal, WASM + WebGPU, fpgas, weird mobile accelerators and such. It supports quantization, dynamism and other cool features.

But... It isn't used much outside MLC? And MLC's implementations are basically demos.

I dunno why. AI inference communities are dying for fast multiplatform backends without the fuss of PyTorch.

crowwork2y ago

Checkout the latest docs https://mlc.ai/mlc-llm/docs/ MLC started with demos and it evolved lately, with API integrations, documentations into an inference solution that everyone can reuse for universal deployments

brucethemoose22y ago

Its been awhile since I looked into this, thanks.

As a random aside, I hope y'all publish a SDXL repo for local (non webgpu) inference. SDXL is too compute heavy to split/offload to cpu like Llama.cpp, but less ram heavy than llms, and I'm thinking it would benefit from TVM's "easy" quantization.

It would be a great backend to hook into the various web UIs, maybe with the secondary model loaded on an IGP.

junrushao19942y ago

I don't think TVM advertised a lot on its full capabilities, for example, high-perf codegen for dynamic shapes without auto-tuning, or auto-tuning-based codegen, at least in the past few years, and that might be one of the factors it doesn't got a lot of visibility.

brucethemoose22y ago

I think this is true of AI compilation in general. Torch MLIR, AITemplate and really everything here fly under the radar:

https://github.com/merrymercy/awesome-tensor-compilers#open-...

ruihanglOP2y ago

Purely running in web browser. Generating 6.2 tok/s on Apple M2 Ultra with 64GB of memory.

j / k navigate · click thread line to collapse