Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B (opens in new tab)

(twitter.com)

95 points_micah_h1y ago42 comments

42 comments

32 comments · 9 top-level

freediver1y ago· 8 in thread

Yep it is fast. Now what exactly is Llama 8B useful is another matter - what are some good use cases?

One scenario I can think of is rolepaying - but I would assume that the slow streaming speed was kind of a feature there.

seldo1y ago

For agentic use cases, where you might need several round-trips to the LLM to reflect on a query, improve a result, etc., getting fast inference means you can do more round-trips while still responding in reasonable time. So basically any LLM use-case is improved by having greater speed available IMO.

freediver1y ago

The problem with this is tok/sec does not tell you what time to first token is. I've seen (with Groq) where this is large for large prompts, nullifying the advantage of faster tok/sec.

rgbrgb1y ago

Speed is useful for batch tasks or doing a bunch of serial tasks quickly. E.g. "take these 1000 pitch decks and give me 5 bullets on each", "run this prompt 100 times and then pick the best response", "detect which of these 100k comments mention the SF Giants".

drdaeman1y ago

8B is not exactly great for roleplaying, if we put the bar any high. It is just not sophisticated enough, as it has very limited "reasoning"-like capabilities and can normally make sensible conclusions only about very basic things (like if it's raining, maybe character will get wet). It can and will hallucinate about stuff like inventories or rules - and it's not a context length thing. If there are multiple NPCs, things get worse, as they're starting to all mix up.

70B does significantly better in this regard. Nowhere close to perfection, but the frequency of WTFs about LLM's output are [subjectively] drastically lower.

Speed can be useful in RP if we'd run multiple LLM-based agents (like "plot", "goal checker", "inventory", "validation", "narrator") that function call each other to achieve some goal.

wkat42421y ago

These wafers only have 44GB of RAM though. Very curious why the quantity is so low considering the chips are absolutely massive. It's SRAM though so very fast, comparable to cache in a modern CPU. But I assume being fast and loading the whole model there is the point.

1 more reply

halJordan1y ago

What kind of answer are you looking for? Just start asking it questions. The constant demand for a magic silver bullet use case applicable to every person in the country is wild. If you have to ask, you're not using it.

What exact use case did google.com enable you to do that made it worthwhile for everyone to immediately start using? It let you access nytimes.com? Access amazon.com? No, it let you ask off the wall, asinine, long tail questions no one else asked.

bottlepalm1y ago

Surveillance states and intelligence agencies.

Or maybe a MMO with a town of NPCs.

benopal641y ago

Why can't the MMO with a town of NPCs have an intelligence agency too?

phkahler1y ago· 6 in thread

The winner will be one of two approaches: 1) Getting great performance using regular DRAM - system memory. 2) Bringing the compute to the RAM chips - DRAM is accessed 64Kb per row (or more?) and at ~10ns per read you can use small/slow ALUs along the row to do MAC operations. Not sure how you program that though.

Current "at home" inference tends to be limited by how much RAM your graphics card has, but system RAM scales better.

eth0up1y ago

I'll probably get stoned for asking here, but... since you seem knowledgeable on the subject:

I just got llama3.1-8b (standard and instruct). However, I cannot do anything with it on my current hardware. Can you recommend the best AI model that I: 1) can self host 2) run on 16GB ram with no dedicated graphics card and an old intel i5 3) use on Debian without installing a bunch of exo-repo mystery code?

Any recommendation, directly or semi related would be appreciated - I'm doing my 'research' but haven't made much progress nor had any questions answered.

smokel1y ago

Running LLMs on that kind of hardware will be very slow (expect responses with only a few words per second, which is probably pretty annoying).

LM Studio [1] makes it very easy to run models locally and play with them. Llama 3.1 will only run in quantized form with 16GB RAM, and that cripples it quite badly, in my opinion.

You may try Phi-3 Mini, which has only 3.8B weights and can still do fun things.

[1] https://lmstudio.ai/

2 more replies

arcanemachiner1y ago

Setting up Ollama via Docker was the easiest way for me to get up and running. Not 100% sure if it fits your constraints, but highly recommended.

1 more reply

ein0p1y ago

+1. For inference especially compute is abundant and basically free in terms of energy. Almost all of the energy is spent on memory movement. The logical solution is to not move unaggregated data.

mikewarot1y ago

Completely eliminating the separation between RAM and compute is how FPGAs are so fast, they do most of the computing as a series of Look Up Tables (LUTs), and optimize for latency and utilization with fancy switching fabrics.

The downside of the switching fabrics is that optimizing a design to fit an FPGA can sometimes take days.

rfoo1y ago

The winner, unfortunately, will be on cloud inference.

mikewarot1y ago· 6 in thread

Why is it so gosh darned slow? If you've got enough transistors to hold 44 gigabytes of RAM, you've got enough to have the whole model in stored with no need for off-chip transfers.

I'd expect tokens out at 1 Ghz aggregate. Anything less than 1 Mhz is a joke.... ok, not a joke, but surprisingly slow.

twothreeone1y ago

Even if they could generate tokens at that speed on the chip (which maybe they can in theory?) you need to get user tokens onto the chip and the resulting model tokens off again and transport them to the user as well. This means at some point the I/O becomes the bottleneck, not the compute. I also suspect it will get faster still, from the announcement it didn't sound like it's "optimal" yet.

cma1y ago

User tokens onto the chip and output tokens out are tiny.

1 more reply

chessgecko1y ago

On die communication isn’t free, a lot of things here are sequential and within matrix multiplies the cores have to transfer output and mem loads have to be distributed. It’s really fast but not like one cycle

mikewarot1y ago

You could add a series of latches, and use the magic of graph coloring to eliminate any timing issues, and pipeline the thing sufficiently to get a GHz of throughput, even if it takes many cycles to make it all the way though the pipe.

Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.

Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.

1 more reply

GaggiX1y ago

It only needs to compute about a trillion floating-point operations per token, and each layer relies on the previous one.

I wonder why it doesn't output a billion tokens per second.

ein0p1y ago

The coarse estimate of compute in transformers is about as many MACs as there are weights, or twice as many flops (because multiplication and addition are counted as separate operations). So for llama 70b that’s about 70b MACs per token, which is manageable. What’s far less manageable is reading the entire model into RAM N times a second

1 more reply

wkat42421y ago· 2 in thread

Wow one chip taking up a whole wafer. I bet their yields are low, though I assume they're not using the bleeding edge process but a slightly older one that's totally worked out.

Still the price of one of these would be nuts if they'd sell them. Upwards of 1 million?

Havoc1y ago

Guessing it’s set up in a way where they can just disable dead cores

twothreeone1y ago

Process defects can be located and routed around statically on the chip, it's described e.g. here: https://youtu.be/8i1_Ru5siXc?t=810

bkitano191y ago· 1 in thread

Time to first token is as important to know for many use cases, rarely are people reporting it

Gcam1y ago

See here for our TTFT metric benchmarks: https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...

russ1y ago

Here’s an AI voice assistant we built this weekend that uses it:

https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...

ein0p1y ago

8b models won’t even need a server a year from now. Basically the only reason to go to the server a year or two from now will be to do what edge devices can’t do: general purpose chat, long context (multimodal especially), data augmented generation that relies on pre-existing data sources in the cloud, etc. And on the server it’s very expensive to run batch size 1. You want to maximize the batch size while also keeping an eye on time to first token and time per token. Basically 20-25 tok/sec generation throughput is a good number for most non-demo workloads. TTFT for median prompt size should ideally be well under 1 sec.

But I’m happy they got this far. It’s an ambitious vision, and it’s extra competition in a field where it’s severely lacking.

ChrisArchitect1y ago

[dupe]

More discussion on official post: https://news.ycombinator.com/item?id=41369705

cheptsov1y ago

Very interested in playing with their hardware and cloud. Also I wonder if it’s possible to try cloud without contacting their sales.

j / k navigate · click thread line to collapse

42 comments

32 comments · 9 top-level

freediver1y ago· 8 in thread

Yep it is fast. Now what exactly is Llama 8B useful is another matter - what are some good use cases?

One scenario I can think of is rolepaying - but I would assume that the slow streaming speed was kind of a feature there.

seldo1y ago

freediver1y ago

The problem with this is tok/sec does not tell you what time to first token is. I've seen (with Groq) where this is large for large prompts, nullifying the advantage of faster tok/sec.

rgbrgb1y ago

drdaeman1y ago

70B does significantly better in this regard. Nowhere close to perfection, but the frequency of WTFs about LLM's output are [subjectively] drastically lower.

Speed can be useful in RP if we'd run multiple LLM-based agents (like "plot", "goal checker", "inventory", "validation", "narrator") that function call each other to achieve some goal.

wkat42421y ago

1 more reply

halJordan1y ago

bottlepalm1y ago

Surveillance states and intelligence agencies.

Or maybe a MMO with a town of NPCs.

benopal641y ago

Why can't the MMO with a town of NPCs have an intelligence agency too?

phkahler1y ago· 6 in thread

Current "at home" inference tends to be limited by how much RAM your graphics card has, but system RAM scales better.

eth0up1y ago

I'll probably get stoned for asking here, but... since you seem knowledgeable on the subject:

Any recommendation, directly or semi related would be appreciated - I'm doing my 'research' but haven't made much progress nor had any questions answered.

smokel1y ago

Running LLMs on that kind of hardware will be very slow (expect responses with only a few words per second, which is probably pretty annoying).

LM Studio [1] makes it very easy to run models locally and play with them. Llama 3.1 will only run in quantized form with 16GB RAM, and that cripples it quite badly, in my opinion.

You may try Phi-3 Mini, which has only 3.8B weights and can still do fun things.

[1] https://lmstudio.ai/

2 more replies

arcanemachiner1y ago

Setting up Ollama via Docker was the easiest way for me to get up and running. Not 100% sure if it fits your constraints, but highly recommended.

1 more reply

ein0p1y ago

+1. For inference especially compute is abundant and basically free in terms of energy. Almost all of the energy is spent on memory movement. The logical solution is to not move unaggregated data.

mikewarot1y ago

The downside of the switching fabrics is that optimizing a design to fit an FPGA can sometimes take days.

rfoo1y ago

The winner, unfortunately, will be on cloud inference.

mikewarot1y ago· 6 in thread

Why is it so gosh darned slow? If you've got enough transistors to hold 44 gigabytes of RAM, you've got enough to have the whole model in stored with no need for off-chip transfers.

I'd expect tokens out at 1 Ghz aggregate. Anything less than 1 Mhz is a joke.... ok, not a joke, but surprisingly slow.

twothreeone1y ago

cma1y ago

User tokens onto the chip and output tokens out are tiny.

1 more reply

chessgecko1y ago

mikewarot1y ago

Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.

1 more reply

GaggiX1y ago

It only needs to compute about a trillion floating-point operations per token, and each layer relies on the previous one.

I wonder why it doesn't output a billion tokens per second.

ein0p1y ago

1 more reply

wkat42421y ago· 2 in thread

Wow one chip taking up a whole wafer. I bet their yields are low, though I assume they're not using the bleeding edge process but a slightly older one that's totally worked out.

Still the price of one of these would be nuts if they'd sell them. Upwards of 1 million?

Havoc1y ago

Guessing it’s set up in a way where they can just disable dead cores

twothreeone1y ago

Process defects can be located and routed around statically on the chip, it's described e.g. here: https://youtu.be/8i1_Ru5siXc?t=810

bkitano191y ago· 1 in thread

Time to first token is as important to know for many use cases, rarely are people reporting it

Gcam1y ago

See here for our TTFT metric benchmarks: https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...

russ1y ago

Here’s an AI voice assistant we built this weekend that uses it:

https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...

ein0p1y ago

But I’m happy they got this far. It’s an ambitious vision, and it’s extra competition in a field where it’s severely lacking.

ChrisArchitect1y ago

[dupe]

More discussion on official post: https://news.ycombinator.com/item?id=41369705

cheptsov1y ago

Very interested in playing with their hardware and cloud. Also I wonder if it’s possible to try cloud without contacting their sales.

j / k navigate · click thread line to collapse