Transformers on Chips (opens in new tab)

(etched.ai)

94 pointsvasinov2y ago62 comments

62 comments

53 comments · 19 top-level

guberti2y ago· 6 in thread

Founder here!

We're still in stealth, but I'll be able to share details and performance figures soon.

Our first product is a bet on transformers. If we're right, there's enormous upside - being transformer-specific lets you get an order of magnitude more compute than more flexible accelerators (GPUs, TPUs).

We're hiring - if the EV makes sense for you, reach out at gavin @ etched.ai

mmaunder2y ago

You’re still in stealth but you’re asking us to meet your supercomputer and are sharing benchmarks that are unfalsifiable. Better show and tell a lot more real soon.

jeswin2y ago

You seem to be downvoted because of the lack of details. Upvoted and thanks for commenting.

1. Do you have a working prototype?

2. Are the pictures real (or close), or entirely CGI?

That would win over a lot of people here on HN.

WhitneyLand2y ago

Hey Gavin, I regret how much skepticism you faced here especially from me.

I believe honest feedback is important but that it should be given in the most productive way possible.

I always want to see fellow entrepreneurs succeed here, and will definitely keep an open mind as you release more details. Best of luck!

abrichr2y ago

Very promising, excited to learn more!

Any thoughts on State Space Models?

Eg:

https://github.com/havenhq/mamba-chat

https://arxiv.org/abs/2311.18257

paulgerhardt2y ago

Curious what approach you’re using. I did some work replicating this paper on an arty7 fpga: https://arxiv.org/abs/2210.08277 - any similarities?

joshspankit2y ago

What’s your projected “model to chip” turnaround?

masterofsome2y ago· 5 in thread

Isn't this kinda pigeonholing yourself to one neural network architecture? Are we sure that transformers will take us to the promised land? Chip design is a pretty expensive and time consuming process, so if a new architecture comes out that is sufficiently different from the current transformer model wouldn't they have to design a completely new chip? The compute unit design is probably similar from architecture to architecture, so maybe I am misunderstanding...

quickthrower22y ago

It’s a bet. Probably a good one to make. The upside of being the ones who have an AI chip (not a graphics chip larping as an AI chip) is huge. It will run faster and more cheaply. You get to step all over OpenAI, or get a multi billion dollar deal to supply Microsoft data centres. Or these ship on every new laptop etc. You get to be the next unicorn ($1tn company). So that is a decent bet for investors assuming the team can deliver. Yes the danger is there is a new architecture that runs on a CPU that for practical purposes whoop’s Attention’s ass. In which case investors can throw some money at ASICifying that.

dontwearitout2y ago

Yep, transformers showed up in 2017, nearly 7 years ago, and they still wear the crown. Maybe some new architecture will come to dominate eventually, but I would love a low cost PCIe board that could run 80B transformer models today.

2 more replies

GaggiX2y ago

Well, GPT-4 runs on a transformer architecture, and even if for unknown reasons GPT-4 is the upper limit of what you can achieve with transformer models, having hardware specialized to run the architecture extremely fast would always be very useful for many tasks (the tasks GPT-4 can already handle at least).

Me10002y ago

This was my first thought too. Even if transformers turn out to be the holy grail for LLMs, people are still interested in diffusion models for image generation.

I think we’re about to see a lot of interesting specialized silicon for neural nets in the coming years, but locking yourself into a specific kind of model seems a little too specialized right now.

dontwearitout2y ago

Diffusion models could actually be implemented with transformers, hypothetically. Their training and inference is what makes diffusion models unique, not the model architecture.

WhitneyLand2y ago· 4 in thread

I am not buying this at all. But I’m not a hardware guy so maybe someone can help with why this is not true:

- Crypto hardware needed SHA256 which is basically tons of bitwise operations. That’s way simpler than the tons of matrix ops transformers need.

- NVidia wasn’t focused on crypto acceleration as a core competency. There are focussed on this, and are already years down the path.

- One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.

- Say they do have a great design. What process are they going to build it on? There are some big customers out there waiting for TMSC space already.

Maybe they have IP and it’s more of a patent play.

(I mention crypto only as an example of custom hardware competing with a GPU)

treesciencebot2y ago

> One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.

This is precisely why people are trying to put logic into memory instead of just making the logic chips simpler. Compute being 10x faster doesn't mean much when you want real-time, near-zero latency in the current day (and potentially, future) ML workloads. Memory bandwith for low batches are much more important, and even though this chip comes with HBM3E (which is cutting edge), that by itself won't make this faster than H200/MI300X.

zaptrem2y ago

Iirc Ethereum ASICs were also memory bandwidth bound. With KV caching transformers are just lots and lots of matrix vector multiplication and are bound by loading the huge weight matrices onto the cores.

smugma2y ago

https://www.eetimes.com/harvard-dropouts-raise-5-million-for...

“Uberti cites bitcoin mining chips as an example of a successful specialized ASIC offering.“

The founder also references crypto, so your comparison is an apt rebuttal to an argument you didn’t know they were making.

Overall, the article gives a small bit of detail, which is infinitely more than gleaned from the website.

wokwokwok2y ago

You are not the only one who is skeptical.

Nvidia has devoted an astronomical amount of effort to supporting AI as their “next big thing”.

…and here is some information-free landing page showing perf which is an order of magnitude above what nvidia is offering.

…but no numbers. You can get called out for numbers.

A vague infographic is much safer.

When things seem to good to be true, they usually are.

I guess some custom hardware with some cherry picked metric here, but frankly the whole thing screams scam.

If it was that easy, Amazon, Google, etc would have already done it with their proven ability to make new silicon.

qeternity2y ago· 4 in thread

Yeah I call BS on this. This does nothing to address the main issues with autoregressive transformer models (memory bandwidth).

GPU compute units are mostly sitting idle these days waiting for chip cache to receive data fr VRAM.

This does nothing to solve that.

dramlord2y ago

You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding

qeternity2y ago

Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.

pavelstoev2y ago

Not exactly idle but only at around 30% utilization on average (measured on a ~900 GPU cluster over ~25 days)

sp3322y ago

If it's at 30% utilization then it's "mostly idle".

1 more reply

andy992y ago· 3 in thread

There is a lot going on in the LLM / AI chip space. Most of the big players are focusing on general purpose AI chips, like Cerebras and Untether. This - what I understand to be more like ASICs is an interesting market. They give up flexibility but presumably can make them more cheaply. There is also Positron AI in this space, mentioned here: https://news.ycombinator.com/item?id=38601761

I'm only peripherally aware of ASICs for bitcoin mining, I have no idea the economics or cycle times. It would be interesting to see a comparison between bitcoin mining chips and AI.

One thing I wonder about is that all of AI is very forward looking, ie anticipating there will be applications to warrant building more infrastructure. It may be a tougher sell to convince someone they need to buy a transformer inference chip now as opposed to something more flexible they'll use in an imagined future.

Zenst2y ago

Only one certainty, HBM memory makers will be doing nicely in the current climate as all these AI processing options are using it in larger and larger volumes. Those will be the unnoticed winners in this rush.

lumost2y ago

In the cloud, these chips will compete head to head with GPUs. If they are able to pull off a 10x price/performance win without excessive porting work… it’ll take off in a heartbeat.

Zenst2y ago

Like ASIC Botcoin miners did. There are parallels here in how it might just pan out.

2 more replies

andy_xor_andrew2y ago· 3 in thread

interesting how MCTS decoding is called out. that seems entirely like a software aspect, which doesn't depend on a particular chip design?

and on the topic of MCTS decoding, I've heard lots of smart people suggest it, but I've yet to see any serious implementation of it. it seems like such an obviously good way to select tokens, you'd think it would be standard in vllm, TGI, llama.cpp, etc. But none of them seem to use it. Perhaps people have tried it and it just don't work as well as you would think?

fnbr2y ago

It’s very difficult to implement, and requires training the network to use it.

I worked at DeepMind on projects that used MCTS. Even with access to the AlphaZero source code, it was very difficult to write an other implementation that got the same results as the original.

andy_xor_andrew2y ago

I'm really curious about this part:

> and requires training the network to use it.

I thought one of the benefits of MCTS was, if you already have your value network, then a general MCTS implementation can walk the tree of values created by that network. And so no special update to the model is necessary. But I'm probably wrong about this.

(also, it boosts my confidence to hear that even folks at DeepMind find MCTS difficult to implement :D Because I tried to implement a simple MCTS a few years back for a very small toy project. I was following a step-by-step explanation of how it worked, and even still, it was super difficult, and very prone to subtle bugs)

1 more reply

fizx2y ago

Doesn't MCTS imply that you'd have to generate a whole tree of tokens? Instead of maybe a 200 token response, you'd have to generate several thousand tokens as you explore the tree?

29athrowaway2y ago· 3 in thread

What about Transformers on FPGAs?

chabons2y ago

AI models on FPGAs has been tried before, for instance: https://www.microsoft.com/en-us/research/project/project-cat....

They haven't been able to compete with GPU's on perf/watt. In general you end up just designing some AI accelerator for the FPGA (because the models are too big to map onto a single device all at once), but it's hard to beat purpose-built tensor and vector HW on a GPU when you're running soft logic.

mikewarot2y ago

FPGAs are designed to fight latency as much as possible. To do this, they have networks of switches to shuttle bits across the chip and keep delays to the bare minimum, in order for synchronous logic to be able to run at the highest possible clock rates for signals that traverse the entire chip.

To meet this goal, there's a huge amount of effort required to compile a program written in Verilog, VHDL, etc.. into a set of bits that can be used to program all of the switching logic and look up tables in the chip. I'm lead to believe it can sometimes take a day or more per compile.

The second factor optimized for in FPGAs is utilization, trying to use 100% of the available resources of the chip. This is never achieved in practice.

Because everything is optimized for speed, it's not very power efficient.

---

Generally, FPGAs aren't the right architecture for neural networks. If you could load all of the weights into the LUTs, and leave them there, you'd get the type of speedups you want, but those scales of FPGA just don't exist.

imtringued2y ago

> I'm lead to believe it can sometimes take a day or more per compile.

This is true and misleading at the same time. Filling a large FPGA takes time, but if you are working with a small FPGA the turnaround time can be 15 minutes.

OJFord2y ago· 2 in thread

Where did this come from? There is absolutely nothing clickable except 'contact us' which just reloads the same page? There's almost zero information here?

Osiris2y ago

Maude you have JS disabled? It’s one of those fancy animations as you scroll websites.

OJFord2y ago

No, I see the animation as I scroll. Very little information though, and no links as far as I can tell to more anywhere. The one clickable element to contact them seems broken.

duskwuff2y ago· 1 in thread

Title was a bit of a letdown. I was hoping for a discussion of silicon planar transformers (like, the electrical component), which are of increasing interest in RF ICs. :)

Taniwha2y ago

Yeah me too, they really ought to explain themselves better

bigdict2y ago· 1 in thread

Product page like this... they haven't even designed the chip. Complete vaporware.

Animats2y ago

Oh. Not that you can tell from the web site.

ilaksh2y ago· 1 in thread

Wow. I wish I could get a computer or VM/VPS with this. Or rent part of one. Use it with quantized models and llama.cpp.

Seems like a big part of using these systems effectively is thinking of ways to take advantage of batching. I guess the normal thing is just to handle multiple user's requests simultaneously. But maybe another one could be moving from working with agents to agent swarms.

zitterbewegung2y ago

I don’t see them doing direct sales and it looks like a cloud offerings.

For training the big part of using these things isn’t batching it’s mainly designing the network and cleaning the data and then training it to get results. Training involves batching but it’s already baked in to libraries .

For inference you take the trained model which is huge and load it into memory and then take the model and have it predict output. The design of this architecture is to not use quantization because lower precision means you want to use less memory while this has a huge amount of memory . To handle multiple users requests you don’t do batching a message queue with multiple receivers it copies of the latest trained model would work.

adriangrigore2y ago· 1 in thread

I don't have a good scrollwheel, not easy to browse the site. :(

williadc2y ago

Spacebar worked pretty well for me.

krasin2y ago

My comment is about the general idea (LLM transformers on a chip), not particular company, as I have no insight into the latter.

Such a chip (with support for LoRA finetuning) would likely be the enabler for the next-gen robotics.

Right now, there is a growing corpus of papers and demos that show what's possible, but these demos are often a talk-to-a-datacenter ordeal, which is not suitable for any serious production use: too high latency, too much dependency on the Internet.

With a low-latency, cost- and energy-efficient way to run finetuned LLMs locally (and keep finetuning based on the specific robot experience), we can actually make something useful in the real world.

rvz2y ago

This only tells me we are at peak AI hype, given that products like this have to dress up ASICs as 'Transformers on Chips' or 'Transformer Supercomputer'.

As always, no technical reports or in-depth benchmarks other than a unlabelled chart comparing against Nvidia H100s with little context and marketing jargon to the untrained eye.

It seems that this would tie you into a specific neural net implementation (i.e llama.cpp as a ASIC) and would have to require a hardware design change to support another.

teaearlgraycold2y ago

Could probably go even faster burning GPT-4's weights right into the silicon. No need to even load weights into memory.

Granted, that eliminates the ability to update the model. But if you already have a model you like that's not a problem.

nojvek2y ago

How expensive will this be?

100T models on one chip with MCTS search.

That is some impressive marketing.

I’ll believe it when I see it.

Great to see so many hardware startups.

Future is deffo accelerated neural nets on hardware.

jwenig2y ago

Given that you believe the transformer is the future, this could flip the state of latency & cost to run these models overnight.

mynameisnoone2y ago

Nonfunctional requirement: Decepticon logo in the chip art. It can't hurt and always adds 10 HP.

jadbox2y ago

Wake me up when I get buy gpt4 as a dedicated chip etch to use as a realtime personal copilot.

j / k navigate · click thread line to collapse

62 comments

53 comments · 19 top-level

guberti2y ago· 6 in thread

Founder here!

We're still in stealth, but I'll be able to share details and performance figures soon.

We're hiring - if the EV makes sense for you, reach out at gavin @ etched.ai

mmaunder2y ago

You’re still in stealth but you’re asking us to meet your supercomputer and are sharing benchmarks that are unfalsifiable. Better show and tell a lot more real soon.

jeswin2y ago

You seem to be downvoted because of the lack of details. Upvoted and thanks for commenting.

1. Do you have a working prototype?

2. Are the pictures real (or close), or entirely CGI?

That would win over a lot of people here on HN.

WhitneyLand2y ago

Hey Gavin, I regret how much skepticism you faced here especially from me.

I believe honest feedback is important but that it should be given in the most productive way possible.

I always want to see fellow entrepreneurs succeed here, and will definitely keep an open mind as you release more details. Best of luck!

abrichr2y ago

Very promising, excited to learn more!

Any thoughts on State Space Models?

Eg:

https://github.com/havenhq/mamba-chat

https://arxiv.org/abs/2311.18257

paulgerhardt2y ago

Curious what approach you’re using. I did some work replicating this paper on an arty7 fpga: https://arxiv.org/abs/2210.08277 - any similarities?

joshspankit2y ago

What’s your projected “model to chip” turnaround?

masterofsome2y ago· 5 in thread

quickthrower22y ago

dontwearitout2y ago

2 more replies

GaggiX2y ago

Me10002y ago

This was my first thought too. Even if transformers turn out to be the holy grail for LLMs, people are still interested in diffusion models for image generation.

I think we’re about to see a lot of interesting specialized silicon for neural nets in the coming years, but locking yourself into a specific kind of model seems a little too specialized right now.

dontwearitout2y ago

Diffusion models could actually be implemented with transformers, hypothetically. Their training and inference is what makes diffusion models unique, not the model architecture.

WhitneyLand2y ago· 4 in thread

I am not buying this at all. But I’m not a hardware guy so maybe someone can help with why this is not true:

- Crypto hardware needed SHA256 which is basically tons of bitwise operations. That’s way simpler than the tons of matrix ops transformers need.

- NVidia wasn’t focused on crypto acceleration as a core competency. There are focussed on this, and are already years down the path.

- One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.

- Say they do have a great design. What process are they going to build it on? There are some big customers out there waiting for TMSC space already.

Maybe they have IP and it’s more of a patent play.

(I mention crypto only as an example of custom hardware competing with a GPU)

treesciencebot2y ago

> One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.

zaptrem2y ago

smugma2y ago

https://www.eetimes.com/harvard-dropouts-raise-5-million-for...

“Uberti cites bitcoin mining chips as an example of a successful specialized ASIC offering.“

The founder also references crypto, so your comparison is an apt rebuttal to an argument you didn’t know they were making.

Overall, the article gives a small bit of detail, which is infinitely more than gleaned from the website.

wokwokwok2y ago

You are not the only one who is skeptical.

Nvidia has devoted an astronomical amount of effort to supporting AI as their “next big thing”.

…and here is some information-free landing page showing perf which is an order of magnitude above what nvidia is offering.

…but no numbers. You can get called out for numbers.

A vague infographic is much safer.

When things seem to good to be true, they usually are.

I guess some custom hardware with some cherry picked metric here, but frankly the whole thing screams scam.

If it was that easy, Amazon, Google, etc would have already done it with their proven ability to make new silicon.

qeternity2y ago· 4 in thread

Yeah I call BS on this. This does nothing to address the main issues with autoregressive transformer models (memory bandwidth).

GPU compute units are mostly sitting idle these days waiting for chip cache to receive data fr VRAM.

This does nothing to solve that.

dramlord2y ago

You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding

qeternity2y ago

Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.

pavelstoev2y ago

Not exactly idle but only at around 30% utilization on average (measured on a ~900 GPU cluster over ~25 days)

sp3322y ago

If it's at 30% utilization then it's "mostly idle".

1 more reply

andy992y ago· 3 in thread

I'm only peripherally aware of ASICs for bitcoin mining, I have no idea the economics or cycle times. It would be interesting to see a comparison between bitcoin mining chips and AI.

Zenst2y ago

lumost2y ago

In the cloud, these chips will compete head to head with GPUs. If they are able to pull off a 10x price/performance win without excessive porting work… it’ll take off in a heartbeat.

Zenst2y ago

Like ASIC Botcoin miners did. There are parallels here in how it might just pan out.

2 more replies

andy_xor_andrew2y ago· 3 in thread

interesting how MCTS decoding is called out. that seems entirely like a software aspect, which doesn't depend on a particular chip design?

fnbr2y ago

It’s very difficult to implement, and requires training the network to use it.

I worked at DeepMind on projects that used MCTS. Even with access to the AlphaZero source code, it was very difficult to write an other implementation that got the same results as the original.

andy_xor_andrew2y ago

I'm really curious about this part:

> and requires training the network to use it.

1 more reply

fizx2y ago

Doesn't MCTS imply that you'd have to generate a whole tree of tokens? Instead of maybe a 200 token response, you'd have to generate several thousand tokens as you explore the tree?

29athrowaway2y ago· 3 in thread

What about Transformers on FPGAs?

chabons2y ago

AI models on FPGAs has been tried before, for instance: https://www.microsoft.com/en-us/research/project/project-cat....

mikewarot2y ago

The second factor optimized for in FPGAs is utilization, trying to use 100% of the available resources of the chip. This is never achieved in practice.

Because everything is optimized for speed, it's not very power efficient.

---

imtringued2y ago

> I'm lead to believe it can sometimes take a day or more per compile.

This is true and misleading at the same time. Filling a large FPGA takes time, but if you are working with a small FPGA the turnaround time can be 15 minutes.

OJFord2y ago· 2 in thread

Where did this come from? There is absolutely nothing clickable except 'contact us' which just reloads the same page? There's almost zero information here?

Osiris2y ago

Maude you have JS disabled? It’s one of those fancy animations as you scroll websites.

OJFord2y ago

No, I see the animation as I scroll. Very little information though, and no links as far as I can tell to more anywhere. The one clickable element to contact them seems broken.

duskwuff2y ago· 1 in thread

Title was a bit of a letdown. I was hoping for a discussion of silicon planar transformers (like, the electrical component), which are of increasing interest in RF ICs. :)

Taniwha2y ago

Yeah me too, they really ought to explain themselves better

bigdict2y ago· 1 in thread

Product page like this... they haven't even designed the chip. Complete vaporware.

Animats2y ago

Oh. Not that you can tell from the web site.

ilaksh2y ago· 1 in thread

Wow. I wish I could get a computer or VM/VPS with this. Or rent part of one. Use it with quantized models and llama.cpp.

zitterbewegung2y ago

I don’t see them doing direct sales and it looks like a cloud offerings.

adriangrigore2y ago· 1 in thread

I don't have a good scrollwheel, not easy to browse the site. :(

williadc2y ago

Spacebar worked pretty well for me.

krasin2y ago

My comment is about the general idea (LLM transformers on a chip), not particular company, as I have no insight into the latter.

Such a chip (with support for LoRA finetuning) would likely be the enabler for the next-gen robotics.

With a low-latency, cost- and energy-efficient way to run finetuned LLMs locally (and keep finetuning based on the specific robot experience), we can actually make something useful in the real world.

rvz2y ago

This only tells me we are at peak AI hype, given that products like this have to dress up ASICs as 'Transformers on Chips' or 'Transformer Supercomputer'.

As always, no technical reports or in-depth benchmarks other than a unlabelled chart comparing against Nvidia H100s with little context and marketing jargon to the untrained eye.

It seems that this would tie you into a specific neural net implementation (i.e llama.cpp as a ASIC) and would have to require a hardware design change to support another.

teaearlgraycold2y ago

Could probably go even faster burning GPT-4's weights right into the silicon. No need to even load weights into memory.

Granted, that eliminates the ability to update the model. But if you already have a model you like that's not a problem.

nojvek2y ago

How expensive will this be?

100T models on one chip with MCTS search.

That is some impressive marketing.

I’ll believe it when I see it.

Great to see so many hardware startups.

Future is deffo accelerated neural nets on hardware.

jwenig2y ago

Given that you believe the transformer is the future, this could flip the state of latency & cost to run these models overnight.

mynameisnoone2y ago

Nonfunctional requirement: Decepticon logo in the chip art. It can't hurt and always adds 10 HP.

jadbox2y ago

Wake me up when I get buy gpt4 as a dedicated chip etch to use as a realtime personal copilot.

j / k navigate · click thread line to collapse