undefined | Better HN

0 pointsbehnamoh1y ago0 comments

> 1.58bit quantization

of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.

0 comments

2 comments · 1 top-level

danielhanchen1y ago· 1 in thread

Oh you can still run them unquantized! See https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t... where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!

You can do it via `-ot ".ffn_.*_exps.=CPU"`

Thanks, I'll try it! I guess "mixing" GPU+CPU would hurt the perf tho.

j / k navigate · click thread line to collapse