Sohu – first specialized chip (ASIC) for transformer models (opens in new tab)

(twitter.com)

62 pointsrkwasny1y ago21 comments

21 comments

20 comments · 9 top-level

yarri1y ago· 4 in thread

Details from their technical memo at https://www.etched.com/announcing-etched

## How can we fit so much more compute on the silicon?

The NVIDIA H200 has 989 TFLOPS of FP16/BF16 compute without sparsity. This is state-of-the-art (more than even Google’s new Trillium chip), and the GB200 launching in 2025 has only 25% more compute (1,250 TFLOPS per die).

Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:

It takes 10,000 transistors to build a single FP16/BF16/FP8 multiply-add circuit, the building block for all matrix math. The H100 SXM has 528 tensor cores, and each has $4 \times 8 \times 16$ FMA circuits. Multiplying tells us the H100 has 2.7 billion transistors dedicated to tensor cores.

*But an H100 has 80 billion transistors! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!*

This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.

By only running transformers, we can fit way more more FLOPS on our chip, without resorting to lower precisions or sparsity.

## Isn’t memory bandwidth the bottleneck on inference?

For modern models like Llama-3, no!

torginus1y ago

Honestly all this math sounds a bit fishy to me. A H200 has about 5TB/s bandwidth. If we assume a pure matrix multiply workload, we need to fetch 2 FP16 values, which means we are capped at 1.25 TFLOPs. Even best case scenario, where one of the operands is cached, and the other is an FP8, we are only at 5 TB/s which is way less than what the H200 can do.

I don't get how throwing more ALUs at the problem would make things better, it's very much bandwidth constrained.

That's why Groq exists which has a ton of SRAM on chip.

aifath1y ago

Matmul is cubic compute, but quadratic memory.

For [M, K] @ [K, N] read is O(MK + NK) compute is O(MNK) A quick estimate for compute/bandwidth is min(M, N, K). M is batchsize, so they can just blow that up to get nice looking numbers. On Llama 70B, min(N, K) is 3584 and 7168 for matmul's 1 and 2.

Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small.

boznz1y ago

nothing about power consumption

yarri1y ago

This is a datacenter chip. HVAC requirements are more interesting IMO, they seem to be targeting air cooled air edge deployments with that card. They’ll probably wind up with a baseboard design similar to the early v4i TPUs.

https://ieeexplore.ieee.org/document/9499913

modeless1y ago· 3 in thread

Blog post: https://www.etched.com/announcing-etched

I don't see a lot of detail on the actual architecture. Hard to evaluate if it makes sense. The idea of specializing in transformers has merit I think. It seems like this is only for inference, not training. Although inference accelerators are definitely important and potentially lucrative, I find it hard to get excited about them. To me the exciting thing is accelerating progress in capabilities, which means training.

The title of the blog post is "Etched is Making the Biggest Bet in AI", and it is a pretty big bet to make an ASIC only for transformers. Even something like 1.5-bit transformers could come along and make their hardware blocks obsolete in six months. Actually I would love to see an ASIC implementation of 1.5 bit transformers, it could probably be far more efficient than even this chip.

airstrike1y ago

I know very little about hardware, but could they presumably take whatever they learn from this chip and transfer some of that knowledge into a 1.5bit implementation?

wmf1y ago

Of course, but designing the next chip takes two years or more. If you're always two years behind, you're dead.

1 more reply

bilater1y ago

I actually think its a huge deal if this works as it unlocks use cases like real time AI video, instant natural voice agents, as well as probably a bump in reasoning (either one shot or for multi step processes) by techniques like generating 100s of answers and picking one.

yes - ultimately the big W is smarter models but this is a huge step up on this (local or global?) optima.

nick2381y ago· 2 in thread

Switching over all the ASIC capacity that was going to AntMiners from Crypto to AI?

wmf1y ago

Bitmain doesn't use much capacity.

baobabKoodaa1y ago

gwern1y ago· 1 in thread

Some background: https://www.lesswrong.com/posts/cB2Rtnp7DBTpDy3ii/memory-ban...

Chamix1y ago

Forgive me if I'm missing your existing realization (I did a quick check of your HN, reddit, twitter, LW), but I think the big deal with Sohu (wrt Etched) is that they have pivoted from the "all model parameters hard etched onto the chip" to "only transformer(matmul etc) ops etched onto the chip".

Soho does not have the LLaMA 70b weights directly lithographed onto the silicon, as you seem? to be implying with attachment to that 6month old post.

Seems like a sensible pivot; I'd imagine they're rather up to date on the pulse of dynamically updated nets potentially being a major feature in upcoming frontier models, as you've recently been commentating on. However, I'm not deep enough in it to be sure how much this removes their differentiation vs other AI accelerator startups.

airstrike1y ago· 1 in thread

Related: "AI startup Etched raises $120 million to develop specialized chip"

https://www.reuters.com/technology/artificial-intelligence/a...

The CEO was also on Bloomberg Technology today talking about their strategy a bit. There's an article but I didn't find a video of the interview after quick googling:

https://www.bloomberg.com/news/articles/2024-06-25/ai-chip-s...

samspenc1y ago

Bloomberg has posted the interview video on their channel now https://www.youtube.com/watch?v=zh6REnqwXe4

mikewarot1y ago

So I've read through the blog posts, etc... no actual details on how it works. I suspect it's a huge matrix multiply/accumulate chip with registers for all the weights and biases that get loaded once per section of the model, so that you only spend I/O on getting the tokens in and out, and doing the softmax, etc.. on a CPU.

I've considered similar thoughts with my toy BitGrid model... except I'd actually take the weights and compile them to boolean logic, since it would improve utilization. Program the chip, throw parameters in one side, get them out (later) on the other.

lukaslezevicius1y ago

Would the most likely impact of this be a decrease in the cost of inference if these chips are manufactured at scale?

jhylau1y ago

but can they run RAG on these things? you can't just run a pure pre-trained LLM as that will have limited use cases.

Bluestein1y ago

> Etched has partnered with Taiwan Semiconductor Manufacturing Co. (2330.TW), opens new tab to fabricate the chips. Uberti said the company needs the series-A funding to defray the costs of sending its designs to TSMC

... if it wasn't already, TSMC is going to become pivotal. Ergo, Taiwan. Ergo, stability in the region ...

j / k navigate · click thread line to collapse

21 comments

20 comments · 9 top-level

yarri1y ago· 4 in thread

Details from their technical memo at https://www.etched.com/announcing-etched

## How can we fit so much more compute on the silicon?

Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:

*But an H100 has 80 billion transistors! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!*

This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.

By only running transformers, we can fit way more more FLOPS on our chip, without resorting to lower precisions or sparsity.

## Isn’t memory bandwidth the bottleneck on inference?

For modern models like Llama-3, no!

torginus1y ago

I don't get how throwing more ALUs at the problem would make things better, it's very much bandwidth constrained.

That's why Groq exists which has a ton of SRAM on chip.

aifath1y ago

Matmul is cubic compute, but quadratic memory.

Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small.

boznz1y ago

nothing about power consumption

yarri1y ago

https://ieeexplore.ieee.org/document/9499913

modeless1y ago· 3 in thread

Blog post: https://www.etched.com/announcing-etched

airstrike1y ago

I know very little about hardware, but could they presumably take whatever they learn from this chip and transfer some of that knowledge into a 1.5bit implementation?

wmf1y ago

Of course, but designing the next chip takes two years or more. If you're always two years behind, you're dead.

1 more reply

bilater1y ago

yes - ultimately the big W is smarter models but this is a huge step up on this (local or global?) optima.

nick2381y ago· 2 in thread

Switching over all the ASIC capacity that was going to AntMiners from Crypto to AI?

wmf1y ago

Bitmain doesn't use much capacity.

baobabKoodaa1y ago

gwern1y ago· 1 in thread

Some background: https://www.lesswrong.com/posts/cB2Rtnp7DBTpDy3ii/memory-ban...

Chamix1y ago

Soho does not have the LLaMA 70b weights directly lithographed onto the silicon, as you seem? to be implying with attachment to that 6month old post.

airstrike1y ago· 1 in thread

Related: "AI startup Etched raises $120 million to develop specialized chip"

https://www.reuters.com/technology/artificial-intelligence/a...

The CEO was also on Bloomberg Technology today talking about their strategy a bit. There's an article but I didn't find a video of the interview after quick googling:

https://www.bloomberg.com/news/articles/2024-06-25/ai-chip-s...

samspenc1y ago

Bloomberg has posted the interview video on their channel now https://www.youtube.com/watch?v=zh6REnqwXe4

mikewarot1y ago

lukaslezevicius1y ago

Would the most likely impact of this be a decrease in the cost of inference if these chips are manufactured at scale?

jhylau1y ago

but can they run RAG on these things? you can't just run a pure pre-trained LLM as that will have limited use cases.

Bluestein1y ago

... if it wasn't already, TSMC is going to become pivotal. Ergo, Taiwan. Ergo, stability in the region ...

j / k navigate · click thread line to collapse