Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s (opens in new tab)

(cerebras.ai)

147 pointscampers1y ago84 comments

84 comments

47 comments · 13 top-level

obviyus1y ago· 4 in thread

Wonder if they'll eventually release Whisper support. Groq has been great for transcribing 1hr+ calls at a significnatly lower price compared to OpenAI ($0.36/hr vs. $0.04/hr).

Arn_Thor1y ago

Whisper runs so well locally on any hardware I’ve thrown at it, why run it in the cloud?

swores1y ago

Does it run well on CPU? I've used it locally but only with my high end (consumer/gaming) GPU, and haven't got round to finding out how it does on weaker machines.

1 more reply

obviyus1y ago

That's pretty much exactly how I started. Ran whisper.cpp locally for a while on a 3070Ti. It worked quite well when n=1.

For our use case, we may get 1 audio file at a time, we may get 10. Of course queuing them is possible but we decided to prioritize speed & reliability over self hosting.

1 more reply

BrunoJo1y ago

https://Lemonfox.ai is another alternative to OpenAI's Whisper API if you need support for word-level timestamps and diarization.

asabla1y ago· 4 in thread

Damn, that's some impressive speeds.

At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.

I hope their hardware stays relevant as this field continues to evolve

tjoff1y ago

The biggest time sink for me is validating answers so not sure I agree on that take.

Fast iteration is a killer feature, for sure, but at this time I'd rather focus on quality for it to be worthwhile the effort.

vineyardmike1y ago

If you're using an LLM as a compressed version of a search index, you'll be constantly fighting hallucinations. Respectfully, you're not thinking big-picture enough.

There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.

There are efforts to enable LLMs to "think" by using Chain-of-thought, where the LLM writes out reasoning in a "proof" style list of steps. Sometimes, like with a person, they'd reach a dead-end logic wise. If you can run 3x faster, you can start to run the "thought chain" as more of a "tree" where the logic is critiqued and adapted, and where many different solutions can be tried. This can all happen in parallel (well, each sub-branch).

Then there are "agent" use cases, where an LLM has to take actions on its own in response to real-world situations. Speed really impacts user-perception of quality.

3 more replies

jeswin1y ago

> The biggest time sink for me is validating answers so not sure I agree on that take.

But you're assuming that it'll always ne validated by humans. I'd imagine that most validation (and subsequent processing, especially going forward) will be done on machines.

3 more replies

croes1y ago

Exactly, validating and rewriting the prompt are the real time consuming tasks.

odo12421y ago· 4 in thread

What made it so much faster based on just a software update?

anon2911y ago

Ex-cereberas engineer here. The chip is very powerful and there is no 'one way' to do things. Rearchitecting data flow, changing up data layout, etc can lead to significant performance improvements. That's just my informed speculation. There's likely more perf somewhere

campersOP1y ago

  The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
 
  We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.

germanjoey1y ago

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

bubblethink1y ago

Speculative decoding does not trade off accuracy. You reject the speculated tokens if the original model does not accept them, kind of like branch prediction. All these providers and third parties benchmark each other's solutions, so if there is a drop in accuracy, someone will report it. Their sequence length is 8k.

andrewstuart1y ago· 4 in thread

Could someone please bring Microsoft's Bitnet into the discussion and explain how its performance relates to this announcement, if at all?

https://github.com/microsoft/BitNet

"bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. "

eptcyka1y ago

It is an inference engine for 1bit LLMs, not really comparable.

BoorishBears1y ago

The novelty of the inexplicable bitnet obsession has worn off I think.

qwertox1y ago

IDK, they remind me of Sigma-Delta ADCs [0], which are single bit ADCs but used in high resolution scenarios.

I believe we'll get to hear more interesting things about Bitnet in the future.

[0] https://en.wikipedia.org/wiki/Delta-sigma_modulation

Tepix1y ago

We have yet to see a large model trained using it, haven't we?

1 more reply

simonw1y ago· 3 in thread

It turns out someone has written a plugin for my LLM CLI tool already: https://github.com/irthomasthomas/llm-cerebras

You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:

    pipx install llm # or brew install llm or uv tool install llm
    llm install llm-cerebras
    llm keys set cerebras
    # paste key here

Then you can run lightning fast prompts like this:

    llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'

Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...

croes1y ago

It has a waiting list

londons_explore1y ago

The "AI overview" in google search seems to be a similar speed, and the resulting text of similar quality.

simonw1y ago

I wonder which of their models they use. Might even be Gemini 1.5 Flash 8B which is VERY quick.

I just tried that out with the same prompt and it's fast, but not as fast as Cerebras: https://static.simonwillison.net/static/2024/gemini-flash-8b...

1 more reply

GavCo1y ago· 3 in thread

When Meta releases the quantized 70B it will give another > 2X speedup with similar accuracy: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

YetAnotherNick1y ago

You don't need quantization aware training on larger models. 4 bit 70b and 405b models exhibit close to zero degradation in output with post training quantization[1][2].

[1]: https://arxiv.org/pdf/2409.11055v1 [2]: https://lmarena.ai/

WanderPanda1y ago

I wonder why that is? because they are trained with dropout?

1 more reply

ipsum21y ago

Probably not. Cerebras chip only has 16bit and 32bit operators.

anonzzzies1y ago· 3 in thread

Demo, API?

selcuka1y ago

Demo: https://inference.cerebras.ai/

API: https://cloud.cerebras.ai/

aliljet1y ago

That's odd, attempting a prompt fails because auth isn't working.

bestest1y ago

I filled out a lengthy prompt in the demo. submitted it. an auth window pops up. I don't want to login. I want the demo. such a repulsive approach.

1 more reply

maz1b1y ago· 2 in thread

Cerebras really has impressed me with their technicality and their approach in the modern LLM era. I hope they do well, as I've heard they are en-route to IPO. It will be interesting to see if they can make a dent vs NVIDIA and other players in this space.

madaxe_again1y ago

Apparently so. You can also buy in via various PE outfits before IPO, if you so desire. I did.

Max-201y ago

Which one did you use? I am also interested to do that.

fancyfredbot1y ago· 2 in thread

Wow, software is hard! Imagine an entire company working to build an insanely huge and expensive wafer scale chip and your super smart and highly motivated machine learning engineers get 1/3 of peak performance on their first attempt. When people say NVIDIA has no moat I'm going to remember this - partly because it does show that they do, and partly because it shows that with time the moat can probably be crossed...

exe341y ago

make it work, make it work right(ish), now make it fast.

fancyfredbot1y ago

Fast and wrong is easy!

a21281y ago· 2 in thread

I wonder at what point does increasing LLM throughput only start to serve negative uses of AI. This is already 2 orders of magnitude faster than humans can read. Are there any significant legitimate uses beyond just spamming AI-generated SEO articles and fake Amazon books more quickly and cheaply?

Workaccount21y ago

The way things are going it looks like tokens/s is going to play a big role. O1 preview devours tokens and now Anthropic computer use is devouring them too. Video generation is extremely token heavy too.

It sort of is starting to look like you can linearly boost utility by exponentially scaling token usage per query. If so we might see companies slowing on scaling parameters and instead focusing on scaling token usage.

adwn1y ago

How about just serving more clients in parallel? I don't see why human reading-speed should pose any kind of upper bound.

And then there are use cases like OpenAI's o1, where most tokens aren't even generated for the benefit of a human, but as input for itself.

majke1y ago· 2 in thread

I wonder if there is a token/watt metric. Afaiu cerebras uses plenty of power/cooling.

accrual1y ago

I found this on their product page, though just for peak power:

> At 16 RU, and peak sustained system power of 23kW, the CS-3 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.

It's pretty impressive looking hardware.

https://cerebras.ai/product-system/

menaerus1y ago

Weighing 800kg (!). Like, what the heck.

neals1y ago· 1 in thread

So what is inference?

jonplackett1y ago

Inference just means using the model, rather than training it.

As far as I know Nvidia still has a monopoly on the training part.

d4rkp4ttern1y ago

For those looking to easily build on top of this or other OpenAI-compatible LLM APIs -- you can have a look at Langroid[1] (I am the lead dev): you can easily switch to cerebras (or groq, or other LLMs/Providers). E.g. after installing langroid in your virtual env, and setting up CEREBRAS_API_KEY in your env or .env file, you can run a simple chat example[2] like this:

    python3 examples/basic/chat.py -m cerebras/llama3.1-70b

Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo):

    import langroid.language_models as lm
    import langroid as lr
    llm_config = lm.OpenAIGPTConfig(chat_model= "cerebras/llama3.1-70b")
    agent = lr.ChatAgent(
        lr.ChatAgentConfig(llm=llm_config, system_message="Be helpful but concise"))
    )
    task = lr.Task(agent)
    task.run()

[1] https://github.com/langroid/langroid [2] https://github.com/langroid/langroid/blob/main/examples/basi... [3] Guide to using Langroid with non-OpenAI LLM APIs https://langroid.github.io/langroid/tutorials/local-llm-setu...

j / k navigate · click thread line to collapse

84 comments

47 comments · 13 top-level

obviyus1y ago· 4 in thread

Wonder if they'll eventually release Whisper support. Groq has been great for transcribing 1hr+ calls at a significnatly lower price compared to OpenAI ($0.36/hr vs. $0.04/hr).

Arn_Thor1y ago

Whisper runs so well locally on any hardware I’ve thrown at it, why run it in the cloud?

swores1y ago

Does it run well on CPU? I've used it locally but only with my high end (consumer/gaming) GPU, and haven't got round to finding out how it does on weaker machines.

1 more reply

obviyus1y ago

That's pretty much exactly how I started. Ran whisper.cpp locally for a while on a 3070Ti. It worked quite well when n=1.

For our use case, we may get 1 audio file at a time, we may get 10. Of course queuing them is possible but we decided to prioritize speed & reliability over self hosting.

1 more reply

BrunoJo1y ago

https://Lemonfox.ai is another alternative to OpenAI's Whisper API if you need support for word-level timestamps and diarization.

asabla1y ago· 4 in thread

Damn, that's some impressive speeds.

At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.

I hope their hardware stays relevant as this field continues to evolve

tjoff1y ago

The biggest time sink for me is validating answers so not sure I agree on that take.

Fast iteration is a killer feature, for sure, but at this time I'd rather focus on quality for it to be worthwhile the effort.

vineyardmike1y ago

If you're using an LLM as a compressed version of a search index, you'll be constantly fighting hallucinations. Respectfully, you're not thinking big-picture enough.

Then there are "agent" use cases, where an LLM has to take actions on its own in response to real-world situations. Speed really impacts user-perception of quality.

3 more replies

jeswin1y ago

> The biggest time sink for me is validating answers so not sure I agree on that take.

But you're assuming that it'll always ne validated by humans. I'd imagine that most validation (and subsequent processing, especially going forward) will be done on machines.

3 more replies

croes1y ago

Exactly, validating and rewriting the prompt are the real time consuming tasks.

odo12421y ago· 4 in thread

What made it so much faster based on just a software update?

anon2911y ago

campersOP1y ago

  The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
 
  We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.

germanjoey1y ago

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

bubblethink1y ago

andrewstuart1y ago· 4 in thread

Could someone please bring Microsoft's Bitnet into the discussion and explain how its performance relates to this announcement, if at all?

https://github.com/microsoft/BitNet

eptcyka1y ago

It is an inference engine for 1bit LLMs, not really comparable.

BoorishBears1y ago

The novelty of the inexplicable bitnet obsession has worn off I think.

qwertox1y ago

IDK, they remind me of Sigma-Delta ADCs [0], which are single bit ADCs but used in high resolution scenarios.

I believe we'll get to hear more interesting things about Bitnet in the future.

[0] https://en.wikipedia.org/wiki/Delta-sigma_modulation

Tepix1y ago

We have yet to see a large model trained using it, haven't we?

1 more reply

simonw1y ago· 3 in thread

It turns out someone has written a plugin for my LLM CLI tool already: https://github.com/irthomasthomas/llm-cerebras

You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:

    pipx install llm # or brew install llm or uv tool install llm
    llm install llm-cerebras
    llm keys set cerebras
    # paste key here

Then you can run lightning fast prompts like this:

    llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'

Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...

croes1y ago

It has a waiting list

londons_explore1y ago

The "AI overview" in google search seems to be a similar speed, and the resulting text of similar quality.

simonw1y ago

I wonder which of their models they use. Might even be Gemini 1.5 Flash 8B which is VERY quick.

I just tried that out with the same prompt and it's fast, but not as fast as Cerebras: https://static.simonwillison.net/static/2024/gemini-flash-8b...

1 more reply

GavCo1y ago· 3 in thread

When Meta releases the quantized 70B it will give another > 2X speedup with similar accuracy: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

YetAnotherNick1y ago

You don't need quantization aware training on larger models. 4 bit 70b and 405b models exhibit close to zero degradation in output with post training quantization[1][2].

[1]: https://arxiv.org/pdf/2409.11055v1 [2]: https://lmarena.ai/

WanderPanda1y ago

I wonder why that is? because they are trained with dropout?

1 more reply

ipsum21y ago

Probably not. Cerebras chip only has 16bit and 32bit operators.

anonzzzies1y ago· 3 in thread

Demo, API?

selcuka1y ago

Demo: https://inference.cerebras.ai/

API: https://cloud.cerebras.ai/

aliljet1y ago

That's odd, attempting a prompt fails because auth isn't working.

bestest1y ago

I filled out a lengthy prompt in the demo. submitted it. an auth window pops up. I don't want to login. I want the demo. such a repulsive approach.

1 more reply

maz1b1y ago· 2 in thread

madaxe_again1y ago

Apparently so. You can also buy in via various PE outfits before IPO, if you so desire. I did.

Max-201y ago

Which one did you use? I am also interested to do that.

fancyfredbot1y ago· 2 in thread

exe341y ago

make it work, make it work right(ish), now make it fast.

fancyfredbot1y ago

Fast and wrong is easy!

a21281y ago· 2 in thread

Workaccount21y ago

adwn1y ago

How about just serving more clients in parallel? I don't see why human reading-speed should pose any kind of upper bound.

And then there are use cases like OpenAI's o1, where most tokens aren't even generated for the benefit of a human, but as input for itself.

majke1y ago· 2 in thread

I wonder if there is a token/watt metric. Afaiu cerebras uses plenty of power/cooling.

accrual1y ago

I found this on their product page, though just for peak power:

> At 16 RU, and peak sustained system power of 23kW, the CS-3 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.

It's pretty impressive looking hardware.

https://cerebras.ai/product-system/

menaerus1y ago

Weighing 800kg (!). Like, what the heck.

neals1y ago· 1 in thread

So what is inference?

jonplackett1y ago

Inference just means using the model, rather than training it.

As far as I know Nvidia still has a monopoly on the training part.

d4rkp4ttern1y ago

    python3 examples/basic/chat.py -m cerebras/llama3.1-70b

Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo):

    import langroid.language_models as lm
    import langroid as lr
    llm_config = lm.OpenAIGPTConfig(chat_model= "cerebras/llama3.1-70b")
    agent = lr.ChatAgent(
        lr.ChatAgentConfig(llm=llm_config, system_message="Be helpful but concise"))
    )
    task = lr.Task(agent)
    task.run()

j / k navigate · click thread line to collapse