We're still in stealth, but I'll be able to share details and performance figures soon.
Our first product is a bet on transformers. If we're right, there's enormous upside - being transformer-specific lets you get an order of magnitude more compute than more flexible accelerators (GPUs, TPUs).
We're hiring - if the EV makes sense for you, reach out at gavin @ etched.ai
1. Do you have a working prototype?
2. Are the pictures real (or close), or entirely CGI?
That would win over a lot of people here on HN.
I believe honest feedback is important but that it should be given in the most productive way possible.
I always want to see fellow entrepreneurs succeed here, and will definitely keep an open mind as you release more details. Best of luck!
Any thoughts on State Space Models?
Eg:
I think we’re about to see a lot of interesting specialized silicon for neural nets in the coming years, but locking yourself into a specific kind of model seems a little too specialized right now.
- Crypto hardware needed SHA256 which is basically tons of bitwise operations. That’s way simpler than the tons of matrix ops transformers need.
- NVidia wasn’t focused on crypto acceleration as a core competency. There are focussed on this, and are already years down the path.
- One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.
- Say they do have a great design. What process are they going to build it on? There are some big customers out there waiting for TMSC space already.
Maybe they have IP and it’s more of a patent play.
(I mention crypto only as an example of custom hardware competing with a GPU)
This is precisely why people are trying to put logic into memory instead of just making the logic chips simpler. Compute being 10x faster doesn't mean much when you want real-time, near-zero latency in the current day (and potentially, future) ML workloads. Memory bandwith for low batches are much more important, and even though this chip comes with HBM3E (which is cutting edge), that by itself won't make this faster than H200/MI300X.
“Uberti cites bitcoin mining chips as an example of a successful specialized ASIC offering.“
The founder also references crypto, so your comparison is an apt rebuttal to an argument you didn’t know they were making.
Overall, the article gives a small bit of detail, which is infinitely more than gleaned from the website.
Nvidia has devoted an astronomical amount of effort to supporting AI as their “next big thing”.
…and here is some information-free landing page showing perf which is an order of magnitude above what nvidia is offering.
…but no numbers. You can get called out for numbers.
A vague infographic is much safer.
When things seem to good to be true, they usually are.
I guess some custom hardware with some cherry picked metric here, but frankly the whole thing screams scam.
If it was that easy, Amazon, Google, etc would have already done it with their proven ability to make new silicon.
GPU compute units are mostly sitting idle these days waiting for chip cache to receive data fr VRAM.
This does nothing to solve that.
But it's not free, and still comes at a cost of per-stream latency.
Speculative decoding seems less effective in practice than in theory.
I'm only peripherally aware of ASICs for bitcoin mining, I have no idea the economics or cycle times. It would be interesting to see a comparison between bitcoin mining chips and AI.
One thing I wonder about is that all of AI is very forward looking, ie anticipating there will be applications to warrant building more infrastructure. It may be a tougher sell to convince someone they need to buy a transformer inference chip now as opposed to something more flexible they'll use in an imagined future.
and on the topic of MCTS decoding, I've heard lots of smart people suggest it, but I've yet to see any serious implementation of it. it seems like such an obviously good way to select tokens, you'd think it would be standard in vllm, TGI, llama.cpp, etc. But none of them seem to use it. Perhaps people have tried it and it just don't work as well as you would think?
I worked at DeepMind on projects that used MCTS. Even with access to the AlphaZero source code, it was very difficult to write an other implementation that got the same results as the original.
> and requires training the network to use it.
I thought one of the benefits of MCTS was, if you already have your value network, then a general MCTS implementation can walk the tree of values created by that network. And so no special update to the model is necessary. But I'm probably wrong about this.
(also, it boosts my confidence to hear that even folks at DeepMind find MCTS difficult to implement :D Because I tried to implement a simple MCTS a few years back for a very small toy project. I was following a step-by-step explanation of how it worked, and even still, it was super difficult, and very prone to subtle bugs)
They haven't been able to compete with GPU's on perf/watt. In general you end up just designing some AI accelerator for the FPGA (because the models are too big to map onto a single device all at once), but it's hard to beat purpose-built tensor and vector HW on a GPU when you're running soft logic.
To meet this goal, there's a huge amount of effort required to compile a program written in Verilog, VHDL, etc.. into a set of bits that can be used to program all of the switching logic and look up tables in the chip. I'm lead to believe it can sometimes take a day or more per compile.
The second factor optimized for in FPGAs is utilization, trying to use 100% of the available resources of the chip. This is never achieved in practice.
Because everything is optimized for speed, it's not very power efficient.
---
Generally, FPGAs aren't the right architecture for neural networks. If you could load all of the weights into the LUTs, and leave them there, you'd get the type of speedups you want, but those scales of FPGA just don't exist.
This is true and misleading at the same time. Filling a large FPGA takes time, but if you are working with a small FPGA the turnaround time can be 15 minutes.
Seems like a big part of using these systems effectively is thinking of ways to take advantage of batching. I guess the normal thing is just to handle multiple user's requests simultaneously. But maybe another one could be moving from working with agents to agent swarms.
For training the big part of using these things isn’t batching it’s mainly designing the network and cleaning the data and then training it to get results. Training involves batching but it’s already baked in to libraries .
For inference you take the trained model which is huge and load it into memory and then take the model and have it predict output. The design of this architecture is to not use quantization because lower precision means you want to use less memory while this has a huge amount of memory . To handle multiple users requests you don’t do batching a message queue with multiple receivers it copies of the latest trained model would work.
Such a chip (with support for LoRA finetuning) would likely be the enabler for the next-gen robotics.
Right now, there is a growing corpus of papers and demos that show what's possible, but these demos are often a talk-to-a-datacenter ordeal, which is not suitable for any serious production use: too high latency, too much dependency on the Internet.
With a low-latency, cost- and energy-efficient way to run finetuned LLMs locally (and keep finetuning based on the specific robot experience), we can actually make something useful in the real world.
As always, no technical reports or in-depth benchmarks other than a unlabelled chart comparing against Nvidia H100s with little context and marketing jargon to the untrained eye.
It seems that this would tie you into a specific neural net implementation (i.e llama.cpp as a ASIC) and would have to require a hardware design change to support another.
Granted, that eliminates the ability to update the model. But if you already have a model you like that's not a problem.
100T models on one chip with MCTS search.
That is some impressive marketing.
I’ll believe it when I see it.
Great to see so many hardware startups.
Future is deffo accelerated neural nets on hardware.