undefined | Better HN

0 pointscodemog18d ago0 comments

Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?

Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.

0 comments

alecco18d ago

AFAIK post-training and distillation techniques advanced a lot in the past couple of years. SOTA big models get new frontier and within 6 months it trickles down to open models with 10x less parameters.

And mind the source pre-training data was not made/written for training LLMs, it's just random stuff from Internet, books, etc. So there's a LOT of completely useless an contradictory information. Better training texts are way better and you can just generate & curate from those huge frontier LLMs. This was shown in the TinyStories paper where GPT-4 generated children's stories could make models 3 orders of magnitude smaller achieve quite a lot.

This is why the big US labs complain China is "stealing" their work by distilling their models. Chinese labs save many billions in training with just a bunch of accounts. (I'm just stating what they say, not giving my opinion).

otabdeveloper418d ago

There's diminishing returns bigly when you increase parameter count.

The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

Anyways your perception of a model's "quality" is determined by careful post-training.

codemogOP18d ago

Interesting. I see papers where researchers will finetune models in the 7 to 12b range and even beat or be competitive with frontier models. I wish I knew how this was possible, or had more intuition on such things. If anyone has paper recommendations, I’d appreciate it.

stavros18d ago

They're using a revolutionary new method called "training on the test set".

1 more reply

zozbot23418d ago

More parameters improves general knowledge a lot, but you have to quantize more in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate.

spwa418d ago

The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. Also: the big AI labs have shown a few times that internally they have way more capable models

girvo18d ago

Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.

rustyhancock18d ago

I think it depends on work pattern.

Many do not give Sonnet or even Opus full reign where it really pushes ahead of over models.

If you're asking for tightly constrained single functions at a time it really doesn't make a huge difference.

I.e. the more vibe you do the better you need the model especially over long running and large contexts. Claude is heading and shoulders above everyone else in that setting.

girvo18d ago

>I.e. the more vibe you do the better you need the model especially over long running and large contexts

For sure, but the coolest thing about qwen3.5-plus is the 1mil context length on a $3 coding plan, super neat. But the model isn't really powerful enough to take real advantage of it I've found. Still super neat though!

stavros18d ago

When you say Sonnet 4, do you mean literally 4, or 4.6?

girvo18d ago

It's not as capable as Sonnet 4.6 in my usage over the past couple days, through a few different coding harnesses (including my own for-play one[0], that's been quite fun).

[0] https://github.com/girvo/girvent/

1 more reply

revolvingthrow18d ago

It doesn’t. I’m not sure it outperforms chatgpt 3

BoredomIsFun18d ago

You are not being serrious, are you? even 1.5 years old Mistral and Meta models outperform ChatGPT 3.

gunalx18d ago

3 not 3.5? I think I would even prefer the qwen3.5 0.8b over GPT 3.

j / k navigate · click thread line to collapse

0 comments

alecco18d ago

otabdeveloper418d ago

There's diminishing returns bigly when you increase parameter count.

The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

Anyways your perception of a model's "quality" is determined by careful post-training.

codemogOP18d ago

stavros18d ago

They're using a revolutionary new method called "training on the test set".

1 more reply

zozbot23418d ago

spwa418d ago

girvo18d ago

Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.

rustyhancock18d ago

I think it depends on work pattern.

Many do not give Sonnet or even Opus full reign where it really pushes ahead of over models.

If you're asking for tightly constrained single functions at a time it really doesn't make a huge difference.

I.e. the more vibe you do the better you need the model especially over long running and large contexts. Claude is heading and shoulders above everyone else in that setting.

girvo18d ago

>I.e. the more vibe you do the better you need the model especially over long running and large contexts

stavros18d ago

When you say Sonnet 4, do you mean literally 4, or 4.6?

girvo18d ago

It's not as capable as Sonnet 4.6 in my usage over the past couple days, through a few different coding harnesses (including my own for-play one[0], that's been quite fun).

[0] https://github.com/girvo/girvent/

1 more reply

revolvingthrow18d ago

It doesn’t. I’m not sure it outperforms chatgpt 3

BoredomIsFun18d ago

You are not being serrious, are you? even 1.5 years old Mistral and Meta models outperform ChatGPT 3.

gunalx18d ago

3 not 3.5? I think I would even prefer the qwen3.5 0.8b over GPT 3.

j / k navigate · click thread line to collapse