undefined | Better HN

0 pointsMiraste2mo ago0 comments

What? 35B-A3B is not nearly as smart as 27B.

0 comments

8 comments · 3 top-level

ekianjo2mo ago· 3 in thread

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

Der_Einzige2mo ago

I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

reissbaker2mo ago

Dense is (much) worse in terms of training budget. At inference time, dense is somewhat more intelligent per bit of VRAM, but much slower, so for a given compute budget it's still usually worse in terms of intelligence-per-dollar even ignoring training cost. If you're willing to spend more you're typically better off training and running a larger sparse model rather than training and running a dense one.

Dense is nice for local model users because they only need to serve a single user and VRAM is expensive. For the people training and serving the models, though, dense is really tough to justify. You'll see small dense models released to capitalize on marketing hype from local model fans but that's about it. No one will ever train another big dense model: Llama 3.1 405B was the last of its kind.

1 more reply

naasking2mo ago

MoE isn't inherently better, but I do think it's still an under explored space. When your sparse model can do 5 runs on the same prompt in the same time as a dense model takes to generate one, there opens up all sorts of interesting possibilities.

stratos1232mo ago· 2 in thread

One interesting thing about Qwen3 is that looking at the benchmarks, the 35B-A3B models seem to be only a bit worse than the dense 27B ones. This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

zozbot2342mo ago

> This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.

Hugsun2mo ago

They're comparing Qwen's moe vs dense (smaller difference) against Gemma's moe vs dense (bigger difference). Your proposed alternative misses the point.

1 more reply

zkmon2mo ago

Yes.

j / k navigate · click thread line to collapse

0 comments

8 comments · 3 top-level

ekianjo2mo ago· 3 in thread

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

Der_Einzige2mo ago

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

reissbaker2mo ago

1 more reply

naasking2mo ago

stratos1232mo ago· 2 in thread

zozbot2342mo ago

> This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.

Hugsun2mo ago

They're comparing Qwen's moe vs dense (smaller difference) against Gemma's moe vs dense (bigger difference). Your proposed alternative misses the point.

1 more reply

zkmon2mo ago

Yes.

j / k navigate · click thread line to collapse