Cohere's First Model for Developers (opens in new tab)

(cohere.com)

146 pointshmokiguess13d ago41 comments

41 comments

25 comments · 8 top-level

zuzululu8d ago· 5 in thread

Wasn't aware that Cohere was still around but this release doesn't exactly instill confidence.

>Wasn't aware that Cohere was still around but this release doesn't exactly instill confidence.

It's being kept alive because the Canadian government is desperate to have a local frontier lab and is willing to inject funding and force its adoption in government services, but leadership at Cohere is known to be weak in Canadian tech circles, and they pivoting to an enterprise-first market around production RAG rather than anything close to frontier work.

I'm glad they're doing open weight releases but they're not viable in the long-run. It is embarrassing sharing similar spaces with them, but I'll try this release out in OpenCode and re-think afterwards.

suddenlybananas8d ago

It's embarassing? Awfully harsh!

2 more replies

daijj8d ago

Mulling over applying there to work. Hearing a bunch of mixed reviews where some people also complain about leadership but the day to day seems to be quite good. Any reason big US investors haven't put any money into it? (besides the fact that it's Canadian?)

1 more reply

redwood8d ago

Aren't they focused on embeddings and strong there?

kadoban8d ago

Really? Why not. From the benchmarks at least it's a pretty decent small model.

cyanydeez10d ago· 4 in thread

looks like it's just qwen 3.6 coder.

lumost8d ago

its worse at code compared to qwen 3.6 coder.

stymaar8d ago

How can it be worse than something that doesn't exist?

1 more reply

SubiculumCode8d ago

Do you mean it's based on qwen 3.6 coder?

daemonologist8d ago

There is no "coder" version of Qwen 3.6; I think they just mean it's a coding-focused model of similar size and performance (to Qwen 3.6 35B-A3B).

Regular Qwen 3.6 benchmarks slightly better and has much wider software support though, so this is probably of interest only to organizations which disallow models trained in China.

1 more reply

matt_daemon8d ago· 3 in thread

> Hardware (minimum): 1× H100 @ FP8

Cool to see this but seems like it would be pretty expensive to run

anon3738398d ago

This is a 30B parameter model with 3B active. It should run performantly on a Mac with > 48GB RAM at 8bit precision.

ltononro8d ago

Well that is like 3 USD/hour if you run it on a rented gpu

yencabulator8d ago

4-bit quantized 30B-A3B MoE models can run at something like 21 tokens/sec on a several year old AMD CPU.

amunozo8d ago· 2 in thread

Are these models trained from scratch or do they necessarily need distillation from bigger models to be competitive? It's usually the case that they're a small model for a family with a bigger model. In the first case, does anybody know what's the economy of training this 30B-A3B model vs. training a DeepSeek V4 Pro or Flash size of models (1.6T, 200 something B, less activated)?

namr20008d ago

You don't have to train from scratch but you can. Distillation ends up being somewhere in the ballpark of 1000x faster to train [1]. It also comes with the huge advantage of not needing to create RLHF datasets, since you can just copy the behavior of the teacher model. This saves an enormous amount of labeling money at the cost of making the model behave similarly to the teacher. If you are training from scratch, you can look at LLM scaling laws to figure out roughly the compute budget you need to optimally train a model [2].

Based on [2] a 30B model needs something like 2e+23 FLOPS to train from scratch whereas a 1.6T model needs something like 1e+27 FLOPs to train. So DeepSeek v4 Pro was roughly 5000x more expensive to train than this model. I'm not totally sure how MOE affects scaling laws, so these numbers might be different in reality, but it gives you a good ballpark estimate of the difference in training scale.

[1] https://arxiv.org/abs/2505.12781 [2] https://arxiv.org/abs/2203.15556

amunozo7d ago

Thank you for taking the time, this is a very useful and complete answer.

moojacob8d ago· 2 in thread

I was a fan of coheres general purpose LLM. Command A I think? Before they came out with their reasoning model.

More competition is better.

SubiculumCode8d ago

I always forget the VRAM requirements on these MOE things

sipjca8d ago

fwiw because of the relatively few activated params offloading to system RAM is quite feasible, you can see the endless amount of people doing this on r/localllama with qwen3.6 35a3b

1 more reply

AbuAssar8d ago· 1 in thread

strange, I already submitted the same url 6 days ago:

https://news.ycombinator.com/item?id=48475095

mkl8d ago

What's strange? Yours got no comments, so another attempt seems okay. It's pretty random what gets to the front page when.

montroser8d ago

Well, this is certainly not benchmaxxed, I'll give it that. And props for being honest about how far behind Qwen 3.6 MoE is this model.

But yeah, it's not the best look to have to stretch and say it's "competitive" with other models in it's weight class, when it offers not much else that's useful or novel.

tonyrice8d ago

I'm excited to see more OSS models

1 more reply

j / k navigate · click thread line to collapse

41 comments

25 comments · 8 top-level

zuzululu8d ago· 5 in thread

Wasn't aware that Cohere was still around but this release doesn't exactly instill confidence.

greyb8d ago

>Wasn't aware that Cohere was still around but this release doesn't exactly instill confidence.

suddenlybananas8d ago

It's embarassing? Awfully harsh!

2 more replies

daijj8d ago

1 more reply

redwood8d ago

Aren't they focused on embeddings and strong there?

kadoban8d ago

Really? Why not. From the benchmarks at least it's a pretty decent small model.

cyanydeez10d ago· 4 in thread

looks like it's just qwen 3.6 coder.

lumost8d ago

its worse at code compared to qwen 3.6 coder.

stymaar8d ago

How can it be worse than something that doesn't exist?

1 more reply

SubiculumCode8d ago

Do you mean it's based on qwen 3.6 coder?

daemonologist8d ago

There is no "coder" version of Qwen 3.6; I think they just mean it's a coding-focused model of similar size and performance (to Qwen 3.6 35B-A3B).

Regular Qwen 3.6 benchmarks slightly better and has much wider software support though, so this is probably of interest only to organizations which disallow models trained in China.

1 more reply

matt_daemon8d ago· 3 in thread

> Hardware (minimum): 1× H100 @ FP8

Cool to see this but seems like it would be pretty expensive to run

anon3738398d ago

This is a 30B parameter model with 3B active. It should run performantly on a Mac with > 48GB RAM at 8bit precision.

ltononro8d ago

Well that is like 3 USD/hour if you run it on a rented gpu

yencabulator8d ago

4-bit quantized 30B-A3B MoE models can run at something like 21 tokens/sec on a several year old AMD CPU.

amunozo8d ago· 2 in thread

namr20008d ago

[1] https://arxiv.org/abs/2505.12781 [2] https://arxiv.org/abs/2203.15556

amunozo7d ago

Thank you for taking the time, this is a very useful and complete answer.

moojacob8d ago· 2 in thread

I was a fan of coheres general purpose LLM. Command A I think? Before they came out with their reasoning model.

More competition is better.

SubiculumCode8d ago

I always forget the VRAM requirements on these MOE things

sipjca8d ago

fwiw because of the relatively few activated params offloading to system RAM is quite feasible, you can see the endless amount of people doing this on r/localllama with qwen3.6 35a3b

1 more reply

AbuAssar8d ago· 1 in thread

strange, I already submitted the same url 6 days ago:

https://news.ycombinator.com/item?id=48475095

mkl8d ago

What's strange? Yours got no comments, so another attempt seems okay. It's pretty random what gets to the front page when.

montroser8d ago

Well, this is certainly not benchmaxxed, I'll give it that. And props for being honest about how far behind Qwen 3.6 MoE is this model.

But yeah, it's not the best look to have to stretch and say it's "competitive" with other models in it's weight class, when it offers not much else that's useful or novel.

tonyrice8d ago

I'm excited to see more OSS models

1 more reply

j / k navigate · click thread line to collapse