- Maximum intelligence per VRAM (you dont have much)
- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.
When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:
- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.