makes sense - I ultimately need to train the weights so was focusing on GPTQ, I'll try out ggml and see if the latency is better. I have some flexibility on whether I run inference/training on the same model instance. What context length were you using? I was maxing ~2048 tokens, which may also explain the apparent latency.