Not sure how that's a win.
Unless the rest of the wafer is useable for some other customer?
If their routing around the defects is automated enough (given the highly regular structure), it may be a massive economy of efforts on testing and packaging the chip.
That suggests a rectangle is the only possible shape.
I’m out of date on this stuff, so it’s possible things have changed, but I wouldn’t make that assumption. It is (used to be?) standard to pattern the entire wafer, with partially-off-the-wafer dice around the edges of the circle. The reason for this is that etching behavior depends heavily on the surrounding area — the amount of silicon or copper whatever etched in your neighborhood affects the speed of etching for you, which effects line width, and (for a single mask used for the whole wafer) thus either means you need to have more margin on your parameters (equivalent to running on an old process) or have a higher defect right near the edge of the die (which you do anyway, since you can only take “similar neighborhood” so far). This goes as far as, for hyper-optimized things like SRAM arrays, leaving an unused row and column at each border of the array.
The primary driver of time and cost in the fabrication process is the number of layers for the wafers, not the surface area, since all wafers going through a given process are the same size. So you generally want to maximize the number of devices per wafer, because a large part of your costs will be calculated at the per-wafer level, not a per-device level.
[1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-prici...
That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).
In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.
While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.
[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.
[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.
Just like the dotcom bubble, AI is gonna hit, make a few companies stinking rich, and make the vast majority (of both AI-chasing and legacy) companies bankrupt. And it's gonna rewire the way everything else operates too.
This is the part that I think a lot of very tech literate people don't seem to get. I see people all the time essentially saying 'AI is just autocomplete' or pointing out that some vaporware ai company is a scam so surely everyone is.
A lot of it is scams and flash in the pan. But a few of them are going to transform our lives in ways we probably don't even anticipate yet, for good and bad.
This so isn't important to your overall point, but where would I begin to look into this? Sounds fascinating!
Basically the "secret sauce" of the startup recruiting me was that they were going to do wafer scale FPGAs that could be tiled together to build arbitrarily complex systems like military phased array radars and such. All very hush hush but apparently they had recruited some key talent from Xilinx which was annoying Xilinx.
[1] https://patents.justia.com/assignee/cerebras-systems-inc
Can you please explain more why you think so ?
Thank you.
And just as every other hype cycle, this one will crash down hard. The crypto crashes were bad enough but at least gamers got some very cheap GPUs out of all the failed crypto farms back then, but this time so much more money, particularly institutional money, is flowing around AI that we're looking at a repeat of Lehman's once people wake up and realize they've been scammed.
An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.
That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?
amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams
energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ
time = 154 kJ / 39.8 kW = 3.9 seconds
This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?
[1] https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other... [2] https://cerebras.ai/product-system/
This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes. Are they called something else?
It's also not a refrigerator, because those use a pump to pressurise the coolant in its gas phase, whereas here you would only be pumping the water.
So why not use it as an energy source? Spin a turbine.
https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...
They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.
> Despite having built the world’s largest chip, we enable 93% of our silicon area, which is higher than the leading GPU today.
The important part is building the largest chip. The icing on the top is that the enablement is not lower. Which it would be without the routing-to-spare-cores magic sauce.
And the differing terminology is because they're talking about differing things? You could call an SM a core, but it kind of contains (heterogeneous) cores itself. (I've no idea whether intra-SM cores can be redundant to boost yield.) A die is the part you break off and build a computer out of, it may contain a bunch of cores, a wafer can be broken up into multiple dies but for Cerebras it isn't.
If NVIDIA were to go and build a whole-wafer die, they'd do something similar. But Cerebras did it and got it to work. NVIDIA hasn't gotten into that space yet, so there's no point in building a product that you can't sell to a consumer or even a data center that isn't built around that exact product (or to contain a Balrog).
> On the Cerebras side, the effective die size is a bit smaller at 46,225mm2. Applying the same defect rate, the WSE-3 would see 46 defects. Each core is 0.05mm2. This means 2.2mm2 in total would be lost to defects.
So ok they claim that they should see (46225-2.2)/46225 = 99.995%. Doing the same math for their Nvidia numbers it's 99.4%. And yet in practice neither approach got to these numbers. Nowhere near it. I just feel like the whole article talks about all this theory and numbers and math of how they're so much better but in practice it's meaningless.
So what I'm not seeing is why it'd be impossible for all the H100s on a wafer to be interconnected and call it a day. You'd presumably get 92/93 = 98.9% of the performance and, here's the kicker, no need to switch to another architecture. I didn't know where your 0% number came from. Nothing about this article says that a competitor doing the same scaling to wafer scale would get 0%, just a marginal decrease in how many cores made it through fab.
Fundamentally I am not convinced from this article that Cerebras has done something in their design that makes this possible. All I'm seeing is that it'd perform 1% faster.
Edit: thinking a bit more on it, to me it's like they said TSMC has a guy with a sledgehammer who smashes all the wafers and their architecture snaps a tiny bit cleaner. But they haven't said anything about firing the guy with the sledgehammer. Their paragraph before the final table says that this whole exercise is pretty much meaningless because their numbers are made up about competitors and they aren't even the right numbers to be using. Then the table backs up my paraphrase.
So I believe its the opposite: why are they representing the larger square and implying lower yield off the wafer in space that doesnt practically exist?
Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.
and run-time cpu and memory error rates are always nonzero too, though orders of magnitude lower than chip yield rates
[1]: https://www.systemverilog.io/design/ddr4-initialization-and-...
Cerebras yields 46225 * .93 = 43000 square millimeters per wafer
NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer
I don't know if their numbers are correct but it is a strange thing for a startup to brag that it is worse than a big company at something important.
Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.
That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.
I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?
Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?
I'm just gonna say, with serene certainty,
the economic order we inhabit going through phase change is certain. From certain myopic perspectives we can shoehorn that into a narrative of cyclical patterns in the tech industry or financial markets etc etc.
This is not going to be that. No more than the transformation of American retail can be shoehorned to kind of look like it used if you don't know anything at all about what contemporary international trade and logistics and oligopoly actually mean in terms of what is coming into your home from where and why it is or isn't cheap.
Where we'll be in 10, 20, years is literally unimaginable today; and trying to navigate that wrt traditional landmarks... oof.
So the Japanese didn't have any incorrectly made bolts in their manufacturing process so they just added two or three bad ones to every batch to please the Americans.
Note: This author is heavily invested in Nvidia.