My guess is that they realized that just selling hardware is a lot harder than running it themselves. Deploying this level of compute is non-trivial, with very high rates of failure, as well as huge supply chain issues. If you have to sell the hardware and support people buying it, that is a world of trouble.
> no-one wants to take the risk of buying a whole bunch of hardware
I do!
Nobody has stated it yet, but this is probably great news for tenstorrent.
Disclosure: building a cloud compute provider starting with AMD MI300x, and eventually any other high end hardware that our customers are asking for.
Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.
Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM.
The cloud provider path sounds riskier since that’s two capital intensive businesses, chip design and production and running a cloud service provider.
It does seem like an odd move in that case. I liken this to a company like Bitmain. Why sell the miners when you could just run them yourselves? Well, fact is that they do both. But in this case, Groq is turning off the sales. Who knows, maybe it just ends up being a temporary thing until they can sort all of the pieces out.
Since then, one of the co-founders blocked me on Twitter for pointing out that despite their claims, they were not the first MI300x to production. Neither were we, ElioVP gets that trophy, then Lamini, then GigaIO. Making us 4th and them 5th. I could go on and on with weird stuff I've seen them do, but it just isn't productive here.
Anyway, I think we have some overlap since we both are one of the few startups on the planet that actually has MI300x. But beyond that, I believe strongly that this space is large enough for multiple players and I don't see a need to be weird with each other. Apparently, I'm not on the same page though.
¯\_(ツ)_/¯
What is the difference between this and having to sell the cloud access and supporting the people who buy a subscription?
Margins.
Pricing for cloud compute is much higher and servicing and management for the provider is much cheaper.
If I sold hardware directly, then I'm often on the hook for support contracts which can get pricy with hardware and distract from shipping future facing product features, as customers who purchase directly have longer upgrade windows due to logistical overhead.
Knowledge/training.
If you're shipping a brand new hardware arch, exposed as raw hardware, then you're on the hook for training everyone in the world and fixing all their weird edge case uses.
I.e. are you willing to invest in Intel/AMD/Nvidia-scale QA and support?
If you're exposing a PaaS (or even IaaS), then you have some levers you can tweak / mask behind the scenes, so only your team need be experts at low-level operations.
For a fast-paced company, the latter model makes a lot more sense, at least until hardware+software stabilizes.
Good thing that I'm a glutton for punishment.
(also, +100 to valuing honesty and transparency)
I'll let you know once I get my hands on them again. There really isn't enough public information about them at all. So far, my friends at ElioVP [0] have published a blog post. Still with not enough detail for my taste, but I'm pretty sure he is limited by what he can talk about. Luckily, I am not.
I mention in another comment below that my current goal is to get a bunch of people to perform testing on them and then publish blog posts along with open source code. This way, we can start a repository of CI/CD tests to see how things improve with time. ROCm 6.1 is rumored to be quite an improvement.
[0] https://www.evp.cloud/post/diving-deeper-insights-from-our-l...
https://www.reddit.com/r/LocalLLaMA/comments/1bpgrdf/wanted_...
I've got about a dozen people signed up. Just working through some hardware issues right now (see above about high rate of failures), and hope to have this resolved next week, so that I can get people onto them and doing their testing.
5 yr old silicon (14 nm!!) and no hbm.
Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.
(the way I understood it => it's still cost effective at scale due to throughput increase this brings)
No doubt fast SRAM helps, but from a computation pov imho its that they've statically planned computation and eliminated all locks.
Short explainer here: https://www.youtube.com/watch?v=H77tV1KcWIE (Based on their paper).
So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).
This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.
EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.
There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.
I just benchmarked some perf for some of my larger context window queries last week and groq's API took 1.6 seconds versus 1.8 to 2.2 for OpenAI GPT-3.5-turbo. So, it wasn't much faster. I almost emailed their support to see if I was doing something wrong. Would love to hear any details about your workload or the complexity of your queries.
In most of the cases, overall response time is mostly dominated by output as it is ~100x slower per token than input.
I believe certain companies would kill for 20% performance improvements on their main product.
they probably bought NVDA stock :)
That said I think their arch is super interesting. I just think that demo was way too hype when the actual system is pretty impractical.
For a minimum 100 wafers = 10k chips, Groq may have paid $100M = $10k/chip purely in amortizing design costs.
Chip design (software + engineer time) and fabrication setup (lithography masks) grow exponentially [1][2] with smaller nodes, e.g., maybe $100M for Groq's current 14nm chips to ~$500M for their planned 4nm tapeout. Once you reach mass production (>>1000 wafers, which have ~150 large chips each), wafers are $10k each. On top of this, it takes ~1 year to design then have prototypes made. (These same issues still exist on older slower nodes, albeit not as bad.)
This could be reduced somewhat if chip design software were cheaper and margins were lower, but maybe 20% of this cost is due to fundamental manufacturing difficulty.
(disclosure: I don't work with recent tech nodes myself; this is my best guess)
[1] https://www.semianalysis.com/p/the-dark-side-of-the-semicond... [2] https://www.extremetech.com/computing/272096-3nm-process-nod...
Think about the amount of money being dumped into "AI" at this point. If you've got the technology and people to make stuff faster/better/cheaper, finding investors to dump money into your chip making business is probably not as hard as it was 2 years ago.
Groq is making this change for other reasons than the expense of tapping out chips.
can’t comment on specifics, but imo our hardware team punches above its weight class in terms of # of people and time spent in design.
Their hardware was never for people at home, but for cloud providers.
A 7B model would then be able to run on about 60 LPUs. Even at $20,000 per card that would be only $1.2 million and I highly doubt the cost is actually that high, that's just what DigiKey says the cost of an LPU is, if you're trying to buy just one :)
Tenstorrent also looks incredibly Python-specific (as in, everything including their SMI seems mostly Python-based) which doesn't seem promising?
> what do you mean I can’t just drop a CUDA docker image in?
A hardware startup that sells cloud access to its hardware. :-)
What? How does this make sense?
Groq is still under a 30 request per minute rate-limit, which drops to 10 requests per minute if you have all day usage.
Billing has been "coming soon" this whole time, and while they've built out hype enabling features like function calling, somehow they can't setup a Stripe webhook to collect money for realistic rate limits.
They couldn't scream "we can't service the tiniest bit of our demand" any louder at this point.
_
Edit: For anyone looking for fast inference without the smoke and mirrors, I've been using Fireworks.ai in production and it's great. 200 tk/s - 300 tk/s is closer to Groq than it is to OpenAI and co.
And as a bonus they support PEFT with serverless pricing.
I just have free API access with no ability to add a credit card.
The story telling site alone averaged 27k requests a day this week, so about double what their current request limit is, and honestly not even that popular of a site.
You can't run much more than a toy project on their current rate limits.
First, the whole systems of chips architecture that everyone is talking about will solve for increasing overall SRAM available to keep more model state on super fast memory and avoid going to slow memory.
Secondly, anyone serious about their data (enterprises) won't be okay with making API calls to Groq. Anyone serious about their data and have a lot scale (consumer internet) won't also be okay with making expensive API calls to Groq at scale.
Their cloud is attractive only if I can use their API for experimentation toy apps to continue developing in this direction while the rest of the major industry players systems of chip architecture catches up and solves for SRAM size bottleneck and manufacturing process bottleneck, and once that's solved, I get more powerful compute for cheaper $$ to deploy on-prem.
So, this cloud strategy is short-lived. I see another pivot on the horizon.
Linked article:
If customers come with requests for high volumes of chips for very large installations, Groq will instead propose partnering on data center deployment... and yet, they're still leading the field.
I think it's a bit early to think the field is getting commoditized yet.
Do you have any idea how fast Groq is? Go try it. Consistently over 400 t/s for most of the models that they support, and extremely low latency.
remember that EU -> US is ~150ms unavoidable latency, for example. then your comparison is local H100 vs Grok + 150ms latency to first token.
I want to use it, but it's been very unreliable. I have been using Claude 3 and thinking about together.ai with Mixtral.
> “There might need to be a new term, because by the end of next year we’re going to deploy enough LPUs that compute-wise, it’s going to be the equivalent of all the hyperscalers combined,” he said. “We already have a non-trivial portion of that.”
Really? Does anyone seriously believe they are going to be the equivalent of all hyperscalers in compute next year? (Where Meta alone is at 1 million H100 equivalents.) In the same article where they say it's too hard for them to sell chips? And when they literally don't have a setup to even accept a credit card today?
https://www.tomshardware.com/news/no-sram-scaling-implies-on...
IIRC the last big jump for SRAM density was at 7nm, so they do still have that card to play, but progress has slowed to a crawl beyond that. TSMC 3nm SRAM is barely denser than TSMC 7nm SRAM.
I think one major challenge they'll face is that their architecture is incredibly fast at running the ~10-100B parameter open-source models, but starts hitting scaling issues with state-of-the-art models. They need 10k+ chips for a GPT-4-class model, but their optical interconnect only supports a few hundred chips.
[1] https://www.zach.be/p/why-is-everybody-talking-about-groq