100x defect tolerance: How we solved the yield problem (opens in new tab)

(cerebras.ai)

331 pointsjwan5841y ago179 comments

179 comments

96 comments · 30 top-level

ajb1y ago· 17 in thread

So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!

Not sure how that's a win.

Unless the rest of the wafer is useable for some other customer?

nine_k1y ago

It's a win because they have to test one chip, and don't have to spend resources on connecting the chiplets. The latter costs a lot (though it has other advantages). I suspect that a chiplet-based device with total 900k cores would just be not viable due to the size constraints.

If their routing around the defects is automated enough (given the highly regular structure), it may be a massive economy of efforts on testing and packaging the chip.

ungreased06751y ago

Why does it have to be a square? There’s no need to worry about interchangeable third-party heat sink compatibility. Is it possible to make it an irregular polygon instead of square?

kristjansson1y ago

Additional wafer area would be a marginal increase in performance (+~20% core core best case) but increases the complexity of their design, and requires they figure out how to package/connect/house/etc. a non-standard shape. A wafer scale chip is already a huge tech risk, why spend more novelty budget on nonessential weirdness?

Scaevolus1y ago

Why does their chip have to be rectangular, anyways? Couldn't they cut out a (blocky) circle too?

Qwertious1y ago

You need a rectilinear polygon that tessellates, and has the fewest sides possible to minimize the number of cuts necessary. And it would probably help the cutting if the shape is entirely convex, so that cuts can overshoot a bit without damaging anything.

That suggests a rectangle is the only possible shape.

2 more replies

nine_k1y ago

Rather I wonder why do they even need to cut the extra space, instead of putting something there. I suppose that the structure of the device is highly rectangular from the logical PoV, so there's nothing useful to put there. I suspect smaller unrelated chips can be produced on these areas along the way.

guyzero1y ago

I've never cut a wafer, but I assume cutting is hard and single straight lines are the easiest.

1 more reply

yannyu1y ago

The cost driver for fabbing out wafers is the number of layers and the number of usable devices per wafer. Higher layer count increases cost and tends to decrease yield, and more robust designs with higher yields increase usable devices per wafer. If circles or other shapes could help with either of those, they would likely be used. Generally the end goal is to have the most usable devices per wafer, so they'll be packed as tightly as possible on the wafer so as to have the highest potential output.

1 more reply

olejorgenb1y ago

Is the wafer itself so expensive? I assume they don't pattern the unused area, so the process should be quicker?

addaon1y ago

> I assume they don't pattern the unused area

I’m out of date on this stuff, so it’s possible things have changed, but I wouldn’t make that assumption. It is (used to be?) standard to pattern the entire wafer, with partially-off-the-wafer dice around the edges of the circle. The reason for this is that etching behavior depends heavily on the surrounding area — the amount of silicon or copper whatever etched in your neighborhood affects the speed of etching for you, which effects line width, and (for a single mask used for the whole wafer) thus either means you need to have more margin on your parameters (equivalent to running on an old process) or have a higher defect right near the edge of the die (which you do anyway, since you can only take “similar neighborhood” so far). This goes as far as, for hyper-optimized things like SRAM arrays, leaving an unused row and column at each border of the array.

1 more reply

yannyu1y ago

> I assume they don't pattern the unused area, so the process should be quicker?

The primary driver of time and cost in the fabrication process is the number of layers for the wafers, not the surface area, since all wafers going through a given process are the same size. So you generally want to maximize the number of devices per wafer, because a large part of your costs will be calculated at the per-wafer level, not a per-device level.

2 more replies

pulvinar1y ago

There's also no reason they couldn't pattern that area with some other suitable commodity chips. Like how sawmills and butchers put all cuts to use.

1 more reply

ajb1y ago

Good question. I think the wafer has a cost per area which is fairly significant, but I don't have any figures. There has historically been a push to utilise them more efficiently, eg by building fabs that can process larger wafers. Although mask exposure would be per processed area, I think that there are also some proportion of processing time which is per wafer, so the unprocessed area would have an opportunity cost relating to that.

kristjansson1y ago

AIUI Wafer marginal cost is lower than you'd expect. I had $50k in my head, quick google indicates[1] maybe <$20k at AAPL volumes? Regardless seems like the economics for Cerebras would strongly favor yield over wafer area utilization.

[1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-prici...

georgeburdell1y ago

They probably pattern at least next nearest neighbors for local uniformity. That’s just litho though. The rest of the process is done all at once on the wafer

sroussey1y ago

It’s a win if you can use the wafer as opposed to throwing it away.

kristjansson1y ago

A win is a manufacturing process that results in a functioning product. Wafers, etc. aren't so scarce as to demand every mm2 be used on every one every time.

ChuckMcM1y ago· 12 in thread

I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.

That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).

In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.

While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.

[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.

[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.

dogcomplex1y ago

Of course many people are going to collectively lose trillions, AI's a very highly hyped industry with people racing into it without an intellectual edge and any temporary achievement by any one company will be quickly replicated and undercut by another using the same tools. Economic success of the individuals swarming on a new technology is not a guarantee whatsoever, nor is it an indicator of the impact of the technology.

Just like the dotcom bubble, AI is gonna hit, make a few companies stinking rich, and make the vast majority (of both AI-chasing and legacy) companies bankrupt. And it's gonna rewire the way everything else operates too.

idiotsecant1y ago

>it's gonna rewire the way everything else operates too.

This is the part that I think a lot of very tech literate people don't seem to get. I see people all the time essentially saying 'AI is just autocomplete' or pointing out that some vaporware ai company is a scam so surely everyone is.

A lot of it is scams and flash in the pan. But a few of them are going to transform our lives in ways we probably don't even anticipate yet, for good and bad.

1 more reply

ithkuil1y ago

Dollars are not lost; they are just very indirectly invested into gpu makers (and energy providers)

girvo1y ago

> Xilinx was still aggressively suing people who put SERDES ports on FPGAs

This so isn't important to your overall point, but where would I begin to look into this? Sounds fascinating!

ChuckMcM1y ago

Well this was the patent they were threatening with as I recall (https://patents.google.com/patent/US20030023912A1/en) and there was this one too: https://patents.google.com/patent/US5576554A/en

Basically the "secret sauce" of the startup recruiting me was that they were going to do wafer scale FPGAs that could be tiled together to build arbitrarily complex systems like military phased array radars and such. All very hush hush but apparently they had recruited some key talent from Xilinx which was annoying Xilinx.

nroize1y ago

Not OP but I was curious too. Here's all I could find that seemed related: https://www.businesswire.com/news/home/20200121005582/en/Xil...

enragedcacti1y ago

Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s

ChuckMcM1y ago

I could guess that it helps with heat dissipation/management. But I don't know. That guess is from looking at the list of patents[1] they have.

[1] https://patents.justia.com/assignee/cerebras-systems-inc

projektfu1y ago

They did mention that they stash extra cores to enable the re-routing. Those extra cores are presumably unused when not routed in.

1 more reply

__Joker1y ago

"While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage"

Can you please explain more why you think so ?

Thank you.

mschuster911y ago

It's a hype cycle with many of the hypers and deciders having zero idea about what AI actually is and how it works. ChatGPT, while amazing, is at its core a token predictor, it cannot ever get to an AGI level that you'd assume to be competitive to a human, even most animals.

And just as every other hype cycle, this one will crash down hard. The crypto crashes were bad enough but at least gamers got some very cheap GPUs out of all the failed crypto farms back then, but this time so much more money, particularly institutional money, is flowing around AI that we're looking at a repeat of Lehman's once people wake up and realize they've been scammed.

6 more replies

ChuckMcM1y ago

I would guess you're not asking a serious question here but if you were feel free to contact me, it's why I put my email address in my profile.

2 more replies

NickHoff1y ago· 11 in thread

Neat. What about power density?

An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.

That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?

amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams

energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ

time = 154 kJ / 39.8 kW = 3.9 seconds

This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?

Paul_Clayton1y ago

The enthalpy of vaporization of water (at standard pressure) is listed by Wikipedia[1] as 2.257 kJ/g, so boiling 462 grams would require an additional 1.04 MJ, adding 26 seconds. Cerebras claims a "peak sustained system power of 23kW" for the CS-3 16 Rack Unit system[2], so clearly the power density is lower than for an H100.

[1] https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other... [2] https://cerebras.ai/product-system/

twic1y ago

On a tangent: has anyone built an active cooling system which operates in a partial vacuum? At half atmospheric pressure, water boils at around 80 C, which i believe is roughly the operating temperature for a hard-working chip. You could pump water onto the chip, have it vapourise, taking away all that heat, then take the vapour away and condense it at the fan end.

This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes. Are they called something else?

It's also not a refrigerator, because those use a pump to pressurise the coolant in its gas phase, whereas here you would only be pumping the water.

3 more replies

buildbot1y ago

A Very Fancy cooling engine: https://www.eetimes.com/powering-and-cooling-a-wafer-scale-d...

jwan584OP1y ago

A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...

throwup2381y ago

The machine that actually holds one of their wafers is almost as impressive as the chip itself. Tons of water cooling channels and other interesting hardware for cooling.

flopsamjetsam1y ago

Minor correction, the keynote video says ~20 kW

lostlogin1y ago

If rack mounted, you are ending up with something like a reverse power station.

So why not use it as an energy source? Spin a turbine.

kristjansson1y ago

If you let the chip actual boil enough water to run a turbine you're going to have a hard time keeping the magic smoke inside. Much better to run at reasonable temps and try to recover energy from the waste heat.

1 more reply

renhanxue1y ago

There's a bunch of places in Europe that use waste heat from datacenters in district heating systems. Same thing with waste heat from various industrial processes. It's relatively common practice.

sebzim45001y ago

If my very stale physics is accurate then even with perfect thermodynamic efficiency you would only recover about a third of the energy that you put into the chips.

1 more reply

bentcorner1y ago

I'm aware of the efficiency losses but I think it would be amusing to use that turbine to help power the machine generating the heat.

1 more reply

IshKebab1y ago· 5 in thread

TSMC also have a manufacturing process used by Tesla's Dojo where you can cut up the chips, throw away the defective ones, and then reassemble working ones into a sort of wafer scale device (5x5 chips for Dojo). Seems like a more logical design to me.

bee_rider1y ago

Is this similar to a chiplet design? Chiplets have been a thing for a while, so I assume Cerebras avoided them on purpose.

IshKebab1y ago

I don't think so - chiplets are much smaller and I think the process is different.

ryao1y ago

I had been under the impression that Nvidia had done something similar here, but they did not talk about deploying the space saving design and instead only talked about the server rack where all of the chips on the mega wafer normally are.

https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...

wmf1y ago

That shield is just a prop that looks nothing like the real product. The NVL72 rack doesn't use any wafer-scale-like packaging.

1 more reply

mhh__1y ago

Amazing. I clicked a button in the azure deployment menu today...

exabrial1y ago· 4 in thread

I have a dumb question. Why isn't silicon sold in cubes instead of cylinders?

amelius1y ago

The silicon ingots have a rotating production process that results in cylinders, not bricks.

exabrial1y ago

fascinating, I figured it was something like that. maybe we should produce hexagonal, instead of square, chip designs

kryptiskt1y ago

Crystalline silicon is produced with the Czochralski process (https://en.wikipedia.org/wiki/Czochralski_method), which produces a round ingot. So you'd have to cut away perfectly fine silicon to make something squarish.

bigmattystyles1y ago

no matter how you orient a circle on a plane, it's the same

Neywiny1y ago· 3 in thread

Understanding that there's inherent bias by them being competitors of the other companies, but still this article seems to make some stretches. If you told me you had an 8% core defect rate reduced 100x, I'd assume you got to close to 99% enablement. The table at the end shows... Otherwise.

They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.

sfink1y ago

I think you're missing the point. The comparison is not between 93% and 92%. The comparison is between what they're getting (93%) and what you'd get if you scaled up the usual process to the core size they're using (0%). They are doing something different (namely: a ~whole wafer chip) that isn't possible without massively boosting the intra-chip redundancy. (The usual process stops working once you no longer have any extra dies to discard.)

> Despite having built the world’s largest chip, we enable 93% of our silicon area, which is higher than the leading GPU today.

The important part is building the largest chip. The icing on the top is that the enablement is not lower. Which it would be without the routing-to-spare-cores magic sauce.

And the differing terminology is because they're talking about differing things? You could call an SM a core, but it kind of contains (heterogeneous) cores itself. (I've no idea whether intra-SM cores can be redundant to boost yield.) A die is the part you break off and build a computer out of, it may contain a bunch of cores, a wafer can be broken up into multiple dies but for Cerebras it isn't.

If NVIDIA were to go and build a whole-wafer die, they'd do something similar. But Cerebras did it and got it to work. NVIDIA hasn't gotten into that space yet, so there's no point in building a product that you can't sell to a consumer or even a data center that isn't built around that exact product (or to contain a Balrog).

Neywiny1y ago

I think I'll still stand by my viewpoint. They said:

> On the Cerebras side, the effective die size is a bit smaller at 46,225mm2. Applying the same defect rate, the WSE-3 would see 46 defects. Each core is 0.05mm2. This means 2.2mm2 in total would be lost to defects.

So ok they claim that they should see (46225-2.2)/46225 = 99.995%. Doing the same math for their Nvidia numbers it's 99.4%. And yet in practice neither approach got to these numbers. Nowhere near it. I just feel like the whole article talks about all this theory and numbers and math of how they're so much better but in practice it's meaningless.

So what I'm not seeing is why it'd be impossible for all the H100s on a wafer to be interconnected and call it a day. You'd presumably get 92/93 = 98.9% of the performance and, here's the kicker, no need to switch to another architecture. I didn't know where your 0% number came from. Nothing about this article says that a competitor doing the same scaling to wafer scale would get 0%, just a marginal decrease in how many cores made it through fab.

Fundamentally I am not convinced from this article that Cerebras has done something in their design that makes this possible. All I'm seeing is that it'd perform 1% faster.

Edit: thinking a bit more on it, to me it's like they said TSMC has a guy with a sledgehammer who smashes all the wafers and their architecture snaps a tiny bit cleaner. But they haven't said anything about firing the guy with the sledgehammer. Their paragraph before the final table says that this whole exercise is pretty much meaningless because their numbers are made up about competitors and they aren't even the right numbers to be using. Then the table backs up my paraphrase.

fspeech1y ago

There is nothing inherently good about wafer scale. It's actually harder to dissipate heat and enable hybrid bonding with DRAM. So the gp is entirely correct that you need to actually show higher silicon utilization to be even considered as being something worthwhile.

bigmattystyles1y ago· 3 in thread

When I was a kid, I used to get intel keychains with a die in acrylic - good job to whoever thought of that to sell the fully defective chips.

dylan6041y ago

wow, fancy with the acrylic. lots of places just place a chip (I'm more familiar with RAM sticks) on a keychain and call it a day.

kragen1y ago

Those aren't just a chip; they're an epoxy package with a leadframe and a chip inside it. To put just a chip on a keychain, you'd have to drill a hole through it, which is difficult because silicon is so brittle—almost like drilling a hole in glass. Then, when someone put it onto a keyring, the keyring would form a lever that applies a massive force to the edge of the brittle hole, shattering the brittle silicon. Potting the chip in acrylic resin is a much cheaper solution that works better.

bigmattystyles1y ago

they're all over eBay, I just checked - the one I was thinking of, that I think I had is going for $150 - the things you get rid of....

1 more reply

abrookewood1y ago· 3 in thread

Looking at the H100 on the left, why is the chip yield (72) based on a circular layout/constraint? Why do they discard all of the other chips that fall outside the circle?

donavanm1y ago

AFAIK all wafer ingots are cylinders, which means the wafers themselves are a circular cross section. So manufacturing is binpacking rectangles in to a circle. Plus different effects/defects in the chips based on the distance from the edge of the wafer.

So I believe its the opposite: why are they representing the larger square and implying lower yield off the wafer in space that doesnt practically exist?

flumpcakes1y ago

Because the circle is the physical silicon. Any chips that fall outside the circle are only part of a full chip. They will be physically missing half the chip.

therealcamino1y ago

That's just the shape of the wafer. I don't know why the diagram continued the grid outside it.

jstrong1y ago· 2 in thread

I would like a workstation with 900k cores. lmk when these things are on ebay.

riskable1y ago

Just need that 20kW connection to your energy provider.

jstrong1y ago

a man can dream

ryao1y ago· 2 in thread

> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this chip would be very difficult to yield economically. But since its cores (SMs) are fault tolerant, a manufacturing defect does not knock out the entire product. The chip physically has 144 SMs but the commercialized product only has 132 SMs active. This means the chip could suffer numerous defects across 12 SMs and still be sold as a flagship part.

Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.

jjk1661y ago

Redundant cores lead to a fault tolerant chip.

ryao1y ago

ECC memory is fault tolerant. It repairs issues on the fly without disabling hardware. This on the other hand is merely redundant to handle manufacturing defects. If they make a mistake and ship a bad core that malfunctions at runtime, it is not going to tolerate that.

1 more reply

gunalx1y ago· 1 in thread

My biggest question is who are the buyers?

asdasdsddd1y ago

mostly 1 ai company in the middle east last I heard

wendyshu1y ago· 1 in thread

What's yield?

wmf1y ago

It's the fraction of usable product from a manufacturing process.

wizzard01y ago· 1 in thread

this is an important reminder that all digital electronics is really analog but with good correction circuitry.

and run-time cpu and memory error rates are always nonzero too, though orders of magnitude lower than chip yield rates

nine_k1y ago

CPUs may be very digital inside, but DRAM and flash memory are highly analog, especially MLC flash. DDR4 even has a dedicated training mode [1], during which DRAM and the memory controller learn the quirks of particular data lines and adjust to them, in order to communicate reliably.

[1]: https://www.systemverilog.io/design/ddr4-initialization-and-...

bcatanzaro1y ago· 1 in thread

This is a strange blog post. Their tables say:

Cerebras yields 46225 * .93 = 43000 square millimeters per wafer

NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer

I don't know if their numbers are correct but it is a strange thing for a startup to brag that it is worse than a big company at something important.

saulpw1y ago

Being within striking distance of SOTA while using orders of magnitude fewer resources is worth bragging about.

highfrequency1y ago

To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.

Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.

oksurewhynot1y ago

I live in a small city/large town that has a large number of craft breweries. I always marveled at how these small operations were able to churn out so many different varieties. Turns out they are actually trying to make their few core recipes but the yield is so low they market the less consistent results as...all that variety I was so impressed with.

bee_rider1y ago

> Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.

That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.

I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?

Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?

ilaksh1y ago

I assume people are aware, but Cerebras has a web demo and API which is open to try and it is 2000 tokens per second for Llama 3.3 70b and 1000 tokens per second for Llama 3.1 405b.

https://cerebras.ai/inference

anonymousDan1y ago

Very interesting. Am I correct in saying that fault tolerance here is with respect to 'static' errors that occur during manufacturing and are straightforward to detect before reaching the customer? Or can these failures potentially occur later on (and be tolerated) during the normal life of the chip?

aaroninsf1y ago

The number of people ITT this thread who have absorbed the world-weary AI-is-a-bubble skepticism...

I'm just gonna say, with serene certainty,

the economic order we inhabit going through phase change is certain. From certain myopic perspectives we can shoehorn that into a narrative of cyclical patterns in the tech industry or financial markets etc etc.

This is not going to be that. No more than the transformation of American retail can be shoehorned to kind of look like it used if you don't know anything at all about what contemporary international trade and logistics and oligopoly actually mean in terms of what is coming into your home from where and why it is or isn't cheap.

Where we'll be in 10, 20, years is literally unimaginable today; and trying to navigate that wrt traditional landmarks... oof.

larsrc1y ago

How do these much smaller cores compare in computing power to the bigger ones? They seem to implicitly claim that a core is a core is a core, but surely one gets something extra out of the much bigger one?

trhway1y ago

56K mm2 vs 46K mm2. I wonder why they wouldn’t use the smart routing/etc to use more fitting shape than square and thus use more of the wafer.

TowerTall1y ago

Ever heard the old joke story about an American buyer told the Japanese manufacture how many incorrectly made bolts were acceptable per lot of a thousand bolts? Maybe 2 or 3 in 1,000?

So the Japanese didn't have any incorrectly made bolts in their manufacturing process so they just added two or three bad ones to every batch to please the Americans.

ashvardanian1y ago

The AMD comparison may not be accurate. The 96 core AMD CPU takes multiple such dies (eight if I remember correctly) and separate IO chiplets. The total surface area listed should be much larger.

aurareturn1y ago

Bear case on Cerebras: https://irrationalanalysis.substack.com/p/cerebras-cbrso-equ...

Note: This author is heavily invested in Nvidia.

RecycledEle1y ago

IIRC, it was Carl Bruggeman's IPSA Thesis that showed us how to laser out bad cores.

iataiatax101y ago

The yield problem is not surprising they found a solution. Maybe they could elaborate more on the power distribution and dissipation problem?

Fokamul1y ago

Anyone has some picture how it is looks like inside these servers?

hoseja1y ago

Why square chip? Make it an octagon or something.

lofaszvanitt1y ago

A well written, easy to understand article.

j / k navigate · click thread line to collapse

179 comments

96 comments · 30 top-level

ajb1y ago· 17 in thread

Not sure how that's a win.

Unless the rest of the wafer is useable for some other customer?

nine_k1y ago

If their routing around the defects is automated enough (given the highly regular structure), it may be a massive economy of efforts on testing and packaging the chip.

ungreased06751y ago

Why does it have to be a square? There’s no need to worry about interchangeable third-party heat sink compatibility. Is it possible to make it an irregular polygon instead of square?

kristjansson1y ago

Scaevolus1y ago

Why does their chip have to be rectangular, anyways? Couldn't they cut out a (blocky) circle too?

Qwertious1y ago

That suggests a rectangle is the only possible shape.

2 more replies

nine_k1y ago

guyzero1y ago

I've never cut a wafer, but I assume cutting is hard and single straight lines are the easiest.

1 more reply

yannyu1y ago

1 more reply

olejorgenb1y ago

Is the wafer itself so expensive? I assume they don't pattern the unused area, so the process should be quicker?

addaon1y ago

> I assume they don't pattern the unused area

1 more reply

yannyu1y ago

> I assume they don't pattern the unused area, so the process should be quicker?

2 more replies

pulvinar1y ago

There's also no reason they couldn't pattern that area with some other suitable commodity chips. Like how sawmills and butchers put all cuts to use.

1 more reply

ajb1y ago

kristjansson1y ago

[1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-prici...

georgeburdell1y ago

They probably pattern at least next nearest neighbors for local uniformity. That’s just litho though. The rest of the process is done all at once on the wafer

sroussey1y ago

It’s a win if you can use the wafer as opposed to throwing it away.

kristjansson1y ago

A win is a manufacturing process that results in a functioning product. Wafers, etc. aren't so scarce as to demand every mm2 be used on every one every time.

ChuckMcM1y ago· 12 in thread

[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.

[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.

dogcomplex1y ago

idiotsecant1y ago

>it's gonna rewire the way everything else operates too.

A lot of it is scams and flash in the pan. But a few of them are going to transform our lives in ways we probably don't even anticipate yet, for good and bad.

1 more reply

ithkuil1y ago

Dollars are not lost; they are just very indirectly invested into gpu makers (and energy providers)

girvo1y ago

> Xilinx was still aggressively suing people who put SERDES ports on FPGAs

This so isn't important to your overall point, but where would I begin to look into this? Sounds fascinating!

ChuckMcM1y ago

Well this was the patent they were threatening with as I recall (https://patents.google.com/patent/US20030023912A1/en) and there was this one too: https://patents.google.com/patent/US5576554A/en

nroize1y ago

Not OP but I was curious too. Here's all I could find that seemed related: https://www.businesswire.com/news/home/20200121005582/en/Xil...

enragedcacti1y ago

ChuckMcM1y ago

I could guess that it helps with heat dissipation/management. But I don't know. That guess is from looking at the list of patents[1] they have.

[1] https://patents.justia.com/assignee/cerebras-systems-inc

projektfu1y ago

They did mention that they stash extra cores to enable the re-routing. Those extra cores are presumably unused when not routed in.

1 more reply

__Joker1y ago

"While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage"

Can you please explain more why you think so ?

Thank you.

mschuster911y ago

6 more replies

ChuckMcM1y ago

I would guess you're not asking a serious question here but if you were feel free to contact me, it's why I put my email address in my profile.

2 more replies

NickHoff1y ago· 11 in thread

Neat. What about power density?

An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.

That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?

amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams

energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ

time = 154 kJ / 39.8 kW = 3.9 seconds

Paul_Clayton1y ago

[1] https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other... [2] https://cerebras.ai/product-system/

twic1y ago

This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes. Are they called something else?

It's also not a refrigerator, because those use a pump to pressurise the coolant in its gas phase, whereas here you would only be pumping the water.

3 more replies

buildbot1y ago

A Very Fancy cooling engine: https://www.eetimes.com/powering-and-cooling-a-wafer-scale-d...

jwan584OP1y ago

A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...

throwup2381y ago

The machine that actually holds one of their wafers is almost as impressive as the chip itself. Tons of water cooling channels and other interesting hardware for cooling.

flopsamjetsam1y ago

Minor correction, the keynote video says ~20 kW

lostlogin1y ago

If rack mounted, you are ending up with something like a reverse power station.

So why not use it as an energy source? Spin a turbine.

kristjansson1y ago

1 more reply

renhanxue1y ago

There's a bunch of places in Europe that use waste heat from datacenters in district heating systems. Same thing with waste heat from various industrial processes. It's relatively common practice.

sebzim45001y ago

If my very stale physics is accurate then even with perfect thermodynamic efficiency you would only recover about a third of the energy that you put into the chips.

1 more reply

bentcorner1y ago

I'm aware of the efficiency losses but I think it would be amusing to use that turbine to help power the machine generating the heat.

1 more reply

IshKebab1y ago· 5 in thread

bee_rider1y ago

Is this similar to a chiplet design? Chiplets have been a thing for a while, so I assume Cerebras avoided them on purpose.

IshKebab1y ago

I don't think so - chiplets are much smaller and I think the process is different.

ryao1y ago

https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...

wmf1y ago

That shield is just a prop that looks nothing like the real product. The NVL72 rack doesn't use any wafer-scale-like packaging.

1 more reply

mhh__1y ago

Amazing. I clicked a button in the azure deployment menu today...

exabrial1y ago· 4 in thread

I have a dumb question. Why isn't silicon sold in cubes instead of cylinders?

amelius1y ago

The silicon ingots have a rotating production process that results in cylinders, not bricks.

exabrial1y ago

fascinating, I figured it was something like that. maybe we should produce hexagonal, instead of square, chip designs

kryptiskt1y ago

bigmattystyles1y ago

no matter how you orient a circle on a plane, it's the same

Neywiny1y ago· 3 in thread

They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.

sfink1y ago

> Despite having built the world’s largest chip, we enable 93% of our silicon area, which is higher than the leading GPU today.

The important part is building the largest chip. The icing on the top is that the enablement is not lower. Which it would be without the routing-to-spare-cores magic sauce.

Neywiny1y ago

I think I'll still stand by my viewpoint. They said:

Fundamentally I am not convinced from this article that Cerebras has done something in their design that makes this possible. All I'm seeing is that it'd perform 1% faster.

fspeech1y ago

bigmattystyles1y ago· 3 in thread

When I was a kid, I used to get intel keychains with a die in acrylic - good job to whoever thought of that to sell the fully defective chips.

dylan6041y ago

wow, fancy with the acrylic. lots of places just place a chip (I'm more familiar with RAM sticks) on a keychain and call it a day.

kragen1y ago

bigmattystyles1y ago

they're all over eBay, I just checked - the one I was thinking of, that I think I had is going for $150 - the things you get rid of....

1 more reply

abrookewood1y ago· 3 in thread

Looking at the H100 on the left, why is the chip yield (72) based on a circular layout/constraint? Why do they discard all of the other chips that fall outside the circle?

donavanm1y ago

So I believe its the opposite: why are they representing the larger square and implying lower yield off the wafer in space that doesnt practically exist?

flumpcakes1y ago

Because the circle is the physical silicon. Any chips that fall outside the circle are only part of a full chip. They will be physically missing half the chip.

therealcamino1y ago

That's just the shape of the wafer. I don't know why the diagram continued the grid outside it.

jstrong1y ago· 2 in thread

I would like a workstation with 900k cores. lmk when these things are on ebay.

riskable1y ago

Just need that 20kW connection to your energy provider.

jstrong1y ago

a man can dream

ryao1y ago· 2 in thread

Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.

jjk1661y ago

Redundant cores lead to a fault tolerant chip.

ryao1y ago

1 more reply

gunalx1y ago· 1 in thread

My biggest question is who are the buyers?

asdasdsddd1y ago

mostly 1 ai company in the middle east last I heard

wendyshu1y ago· 1 in thread

What's yield?

wmf1y ago

It's the fraction of usable product from a manufacturing process.

wizzard01y ago· 1 in thread

this is an important reminder that all digital electronics is really analog but with good correction circuitry.

and run-time cpu and memory error rates are always nonzero too, though orders of magnitude lower than chip yield rates

nine_k1y ago

[1]: https://www.systemverilog.io/design/ddr4-initialization-and-...

bcatanzaro1y ago· 1 in thread

This is a strange blog post. Their tables say:

Cerebras yields 46225 * .93 = 43000 square millimeters per wafer

NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer

I don't know if their numbers are correct but it is a strange thing for a startup to brag that it is worse than a big company at something important.

saulpw1y ago

Being within striking distance of SOTA while using orders of magnitude fewer resources is worth bragging about.

highfrequency1y ago

To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.

oksurewhynot1y ago

bee_rider1y ago

> Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.

I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?

Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?

ilaksh1y ago

I assume people are aware, but Cerebras has a web demo and API which is open to try and it is 2000 tokens per second for Llama 3.3 70b and 1000 tokens per second for Llama 3.1 405b.

https://cerebras.ai/inference

anonymousDan1y ago

aaroninsf1y ago

The number of people ITT this thread who have absorbed the world-weary AI-is-a-bubble skepticism...

I'm just gonna say, with serene certainty,

Where we'll be in 10, 20, years is literally unimaginable today; and trying to navigate that wrt traditional landmarks... oof.

larsrc1y ago

trhway1y ago

56K mm2 vs 46K mm2. I wonder why they wouldn’t use the smart routing/etc to use more fitting shape than square and thus use more of the wafer.

TowerTall1y ago

Ever heard the old joke story about an American buyer told the Japanese manufacture how many incorrectly made bolts were acceptable per lot of a thousand bolts? Maybe 2 or 3 in 1,000?

So the Japanese didn't have any incorrectly made bolts in their manufacturing process so they just added two or three bad ones to every batch to please the Americans.

ashvardanian1y ago

The AMD comparison may not be accurate. The 96 core AMD CPU takes multiple such dies (eight if I remember correctly) and separate IO chiplets. The total surface area listed should be much larger.

aurareturn1y ago

Bear case on Cerebras: https://irrationalanalysis.substack.com/p/cerebras-cbrso-equ...

Note: This author is heavily invested in Nvidia.

RecycledEle1y ago

IIRC, it was Carl Bruggeman's IPSA Thesis that showed us how to laser out bad cores.

iataiatax101y ago

The yield problem is not surprising they found a solution. Maybe they could elaborate more on the power distribution and dissipation problem?

Fokamul1y ago

Anyone has some picture how it is looks like inside these servers?

hoseja1y ago

Why square chip? Make it an octagon or something.

lofaszvanitt1y ago

A well written, easy to understand article.

j / k navigate · click thread line to collapse