undefined | Better HN

story

0 pointsantirez1y ago0 comments

Ok, this explains why QwQ is working great on their chat. Btw I saw this thing multiple times: that ollama inference, for one reason or the other, even without quantization, somewhat had issues with the actual model performance. In one instance the same model with the same quantization level, if run with MLX was great, and I got terrible results with ollama: the point here is not ollama itself, but there is no testing at all for this models.

I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors.

0 comments

svachalek1y ago

Yeah the state of the art is pretty awful. There have been multiple incidents where a model has been dropped on ollama with the wrong chat template, resulting in it seeming to work but with greatly degraded performance. And I think it's always been a user that notices, not the ollama team or the model team.

refulgentis1y ago

I'm grateful for anyone's contributions to anything, but I kinda shake my head about ollama. the reason stuff like this happens is they're doing the absolute minimal job necessary, to get the latest model running, not working.

I make a llama.cpp wrapper myself, and it's somewhat frustrating putting effort in for everything from big obvious UX things, like error'ing when the context is too small for your input instead of just making you think the model is crap, to long-haul engineering commitments, like integrating new models with llama.cpp's new tool calling infra, and testing them to make sure it, well, actually works.

I keep telling myself that this sort of effort pays off a year or two down the road, once all that differentiation in effort day-to-day adds up. I hope :/

Karrot_Kream1y ago

Can you link your wrapper? I've read and run up against a lot of footguns related to Ollama myself and I think surfacing community efforts to do better would be quite useful.

1 more reply

anon3738391y ago

The test vectors idea is pretty interesting! That's a good one.

I haven't been able to try out QwQ locally yet. There seems to be something wrong with this model on Ollama / my MacBook Pro. The text generation speed is glacial (much, much slower than, say Qwen 72B at the same quant). I also don't see any MLX versions on LM Studio yet.

j / k navigate · click thread line to collapse