Tesla turns on 10k-node Nvidia H100 Cluster (opens in new tab)

(techradar.com)

80 pointsKevcmk2y ago129 comments

129 comments

md_2y ago

I'm confused. The article from September 1 linked to here is strangely future-tense ("But the firm’s latest investment in 10,000 of the company’s H100 GPUs dwarfs the power of this supercomputer....This AI cluster, worth more than $300 million, will offer a peak performance...").

It links to a Tom's Hardware article (https://www.tomshardware.com/news/teslas-dollar300-million-a...) from August 28 that says "Tesla is about to flip the switch on its new AI cluster, featuring 10,000 Nvidia H100 compute GPUs") and says "Tesla is set to launch its highly-anticipated supercomputer on Monday..." (presumably the September 1 event).

So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?

Is the sole source for these articles this (https://twitter.com/SawyerMerritt/status/1696011140508045660) random Twitter post by some guy who runs an online clothing company?

I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

xedeon2y ago

> high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:

https://x.com/sawyermerritt/status/1696012091964915744

https://x.com/tim_zaman/status/1695488119729238147

Also, just FYI. Sawyer posts most of the Tesla and SpaceX breaking news on Twitter before major outlets even write their articles.

For example, here’s one just 12mins ago as confirmed by Elon: https://x.com/sawyermerritt/status/1728092021628313777

A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.

https://x.com/sawyermerritt/status/1709019899442479162

dpkirchner2y ago

I think you only see the additional tweets you're talking about if you're for whatever reason actually signed in to Twitter.

md_2y ago

> If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:

I don't see those when I scroll. I see

"Buckle up everyone, the acceleration of progress is about to get nutty!"

and this is the end of the post?

Maybe I'm misusing this thing?

> https://x.com/tim_zaman/status/1695488119729238147

So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.

But the use of future tense is a bit weird, right? And the lack of any followup?

> A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.

I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)

More seriously:

https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp... says Nvidia is producing 550k H100s in 2023. And there's obviously a significant lead-time requirement.

So, yes, I can sorta imagine Tesla pre-ordered 2% of global supply of H100s early in 2023 and was bragging about it at the end of August just 'cause.

But I can also imagine this is smoke and mirrors, and they have, like, a handful with the rest on backorder, and we haven't heard more about it 'cause Tesla doesn't have marketing people, it just has wahoos who post things on Twitter.

Either way, I guess?

xedeon2y ago

> Maybe I'm misusing this thing?

That seems to be the case here. ;)

> So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.

Another case of misuse? Here’s a tip for you. When you see a company logo/icon on someone's Twitter/X profile. That means they are verified to be affiliated with that org.

“Accounts affiliated with the organization will receive an affiliate badge on their profile with the organization’s logo, and will be featured on the organization’s Twitter profile, indicating their affiliation. “

https://twitter.com/verified/status/1641596848921276417

Instead of inferring that Tim Zaman is a random Twitter user who paid $20 for a blue check. Why not just Google his name? ;)

https://letmegooglethat.com/?q=Tim+Zaman

> I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)

I linked a video where CNBC was interviewing Sawyer but it seems that you didn’t even bother to check it.

This seems to be the problem today. People refuse to do the bare minimum (which is not even much) required for critical thinking. Instead of verifying information, people tend to uncritically repeat inaccurate assumptions, even when provided with additional information in good faith.

1 more reply

chollida12y ago

I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?

I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.

Is this comparable to what they would work with?

My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.

Or is this about what the leading competitors are working with?

Azure, for one, seems to have orders of magnitude more chips at their disposal.

jeffreyames2y ago

10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/

chollida12y ago

Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.

jeffreyames2y ago

High performance on a single task requires simultaneous computation and communication between nodes. If there's high latency between nodes, such as between nodes in different data centers, the communication costs can't be masked by computation.

rightbyte2y ago

I guess Azure's are spread out too. Latency higher to world wide datacentres.

latchkey2y ago

I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...

None of it is easy.

LysPJ2y ago

Out of interest, what did you use all that compute for?

latchkey2y ago

ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.

Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

2 more replies

peteradio2y ago

Classified I imagine.

_zoltan_2y ago

H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.

latchkey2y ago

I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

1 more reply

joshhart2y ago

This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.

I don’t know much about AV processing, that’s highly customized to only a few customers but I’d expect it to also have very large computational requirements to do video processing and reinforcement learning.

kcb2y ago

The most powerful listed supercomputer has 37,888 Radeon GPUs, so in the same order of magnitude.

jbverschoor2y ago

Interesting choice of words... I take you work for OpenAI? :) How large is their/'your' cluster? Probably the biggest in the world by now..

kkielhofner2y ago

Parent is almost certainly talking about Frontier, the supercomputer with the US Department of Energy[0].

[0] - https://top500.org/system/180047/

1 more reply

kcb2y ago

Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.

ben_w2y ago

Last I heard, the estimate was that NVIDIA would build 550k units in 2023, so 2% of all production — especially as at least six others (your four plus Apple and at least one intelligence agency) will be of similar size by themselves — is certainly non-negligible.

2OEH8eoCRo02y ago

550k H100s? Who is buying these? They are hella expensive and China isn't allowed to have them.

ben_w2y ago

Other than the ~12% I just estimated, lots of large-but-not-famous places will be buying ~1k, and small places will be buying tens to hundreds, and quite a lot of AI bubble money will be invested in startups that claim they only need one.

Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.

There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P

alecco2y ago

Government agencies.

ushakov2y ago

The Big Cloud

xvilka2y ago

It's a small cluster the size of large cluster.

jdiez172y ago

What happened to their custom hardware training stack Dojo? They had some interesting ideas there. The last I heard, they had one of those tiles "working" in the lab. Pretty far from a production setup.

I can imagine they either underestimated the software effort needed to squeeze as much performance as possible out of those things, or they underestimated the pace at which Nvidia scales FLOPS/$, or both.

vardump2y ago

Probably they want to have all and any compute they can have. This doesn't exclude Dojo nor the previous generation nvidia chips they already got.

martin84122y ago

Vaporware, just like much of what Musk talks about.

s1gnp0st2y ago

Reusable rockets, electric cars, solar panels...

What would you say grants you the standing to opine here?

nickthegreek2y ago

All of his other false or misleading statements over the last 10 years.

1 more reply

mmcwilliams2y ago

I'm fairly certain all of those existed prior to Musk's suggestion of them.

3 more replies

martin84122y ago

https://elonmusk.today/

1 more reply

MarCylinder2y ago

Actually they're in the middle of production at TSMC. They have 10,000 units on order, to be delivered "in the coming year".

aik2y ago

What that he has talked about been vaporware?

tibbydudeza2y ago

That Telsa owners can use their cars to make money while they are working as robo taxis -let's just say he vastly underestimates effort it takes to make progress - FSD is not there yet.

1 more reply

martin84122y ago

https://elonmusk.today/

1 more reply

jeffbee2y ago

Dojo has always been a lie.

aik2y ago

Source? The article mentions they now have / use both.

millerm2y ago

Your assertion is inaccurate.

alecco2y ago

Original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660

Previus article: https://www.tomshardware.com/news/teslas-dollar300-million-a...

This is second-hand blogspam.

TonyTrapp2y ago

Tom's Hardware and Tech Radar belong to the same company. If you consider this to be blog spam, almost any news website these days would be blog spam.

queuebert2y ago

> almost any news website these days would be blog spam

Yes.

alecco2y ago

Almost everything is in the original tweet.

einpoklum2y ago

And the original tweet is very much kool-aid heavy, with "20x performance", "30x performance" claims about the switch from one card to the next.

dahart2y ago

> This AI cluster, worth more than $300 million, will offer a peak performance of 340 FP64 PFLOPS for technical computing and 39.58 INT8 ExaFLOPS for AI applications, according to Tom’s Hardware.

I was curious why this statement lead with fp64 flops (instead of fp32, perhaps), but I looked up the H100 specs, and NV’s marketing page does the same thing. They’re obviously talking about the H100 SXM here, which has the same peak theoretical fp64 throughput as fp32. The cluster perf is estimated by multiplying the GPU perf by 10k.

Also, obviously, int8 tensor ops aren’t ‘FLOPS’. I think Nvidia calls them “TOPS” (tensor ops). There is a separate metric for ‘tensor flops’ or TF32.

queuebert2y ago

In the old days, depending on architecture, fp64 performance could be atrocious even when fp32 was decent, so bragging about fp64 performance has an authenticity to it. Not all scientific computing requires 64 bits, but knowing that you can drop to high precision when necessary without penalty is nice.

Also, back in the day, integer ops were just called 'ops', grumble grumble. But yeah FLOPS specifically refers to floating point. Calling them TOPS doesn't make sense to me, since tensor cores were meant for matrix operation speedup, and these matrices are rarely integer.

dahart2y ago

Still true that fp64 throughput is lower for consumer GPUs - both NV and AMD. That’s kinda why I was curious about leading with that metric - outside of HPC and scientific applications, a lot of people don’t really need fp64, and the machine might normally have a much higher fp32 throughput.

> knowing you can drop to high precision when necessary without penalty is nice.

I guess I maybe don’t know why you’d ever have 1:1 fp32 and fp64 perf. Aren’t the fp64 multipliers (for example) basically 4x fp32 multipliers? I am under the possibly naive impression that if you have all the transistors for 1 fp64 core, that you’d end up with all the transistors you need for 2 or 4 fp32 cores. Maybe that’s not true today, but there does have to be at least 2x the transistors overall for 64-bit vs 32-bit, and lots of those should be shared or reusable, no? It doesn’t seem quite right to frame naturally higher 32-bit op throughput as a “penalty” on 64-bit ops. You’re asking the hardware to do more with 64, and it makes complete sense that given the exact same budget for bandwidth, energy, memory, compute, etc. that 32-bit ops would go faster, no? If the op throughput of fp64 and fp32 is the same, doesn’t that possibly imply that the fp32 ops are potentially being wasted / penalized, just for the sake of having matching numbers?

petermcneeley2y ago

This is also related to "fast" versions of all some operations. You might want the full 32 bit float but you dont want or need to do full precision division or sqrt operations. This is common in games/graphics and probably machine learning.

queuebert2y ago

You're right -- I have no idea why fp64 wouldn't be half the speed of fp32, and traditionally it is. I was simply taking them at their word. Maybe they're exaggerating or maybe they did what you suggest and hamstrung fp32.

petermcneeley2y ago

Nit: INT8 is not a floating point operation and thus cannot be used in the term "ExaFLOPS"

throwaway4good2y ago

I predict it will run for 5 years and then come up with the answer: FSD needs lidar.

kaycebasques2y ago

n00b questions from someone just beginning to get interested in HPC

I see mention of using this supercomputer for training models. Is that the only purpose? What other types of things do orgs usually do with these supercomputers?

Are there any good boots-on-the-ground technical blogs that provide interesting detail on day-to-day experiences with these things?

abatilo2y ago

As opposed to keeping all of your servers independent of each other, super computers are used any time you want to pretend the entire computer is one computer.

In other words, they're used when you want to share some kind of state across all of the computers, without the potential overhead of communicating to some other system like a database.

Physics simulations and like, molecular modeling come to mind as common examples.

In the case of ML training, model parameters and broadcasting the deltas that get calculated during training are that shared state.

dsco2y ago

Newbie question, could this cluster easily calculate the largest prime number? I've found that the largest known prime number was found back in 2018, which is a while back considering how compute has evolved.

astrodust2y ago

Finding the largest prime is more a contest of who's willing to commit the most ridiculous amount of compute to the goal than it is a mathematical obstacle.

The cost of finding the next prime is likely into the millions now.

cactusplant73742y ago

Is FSD really a hardware problem for them?

amai2y ago

Do they also order a power plant for that cluster? Or how much energy does such a thing need?

bluelightning2k2y ago

It's funny - I'm listening to "The Founders" audiobook and right now they're telling the story of Elon Musk at PayPal wanting to rewrite for Windows server because Linux was too hard.

Weird to think that his next company's compute platform is this.

WendyTheWillow2y ago

Linux was a lot harder back then.

cactusplant73742y ago

Harder for who? Elon certainly didn't have the technical chops to work with it.

WendyTheWillow2y ago

Harder for everyone, including his staff, who were asking him to move to Windows…

2 more replies

jbverschoor2y ago

So THAT's why my power blipped

7e2y ago

Only 10K?

iamgopal2y ago

It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.

chollida12y ago

> It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.

That seems like a bold claim. Google, Microsoft and Meta make so much more money than Telsa that if making AI chips was so easy, then they could clearly out design and build Tesla without thinking too hard about it.

What makes you think that Telsa, a company with far less AI workers and knowledge, and far less money than the above companies can out design and out build them?

md_2y ago

> What makes you think that Telsa, a company with far less AI workers and knowledge, an far less money than the above companies can out design and out build them?

Presumably because Elon himself will be involved in the design, and Elon, as we all know, is one of the world's great thinkers. ;)

1 more reply

kkielhofner2y ago

The dirty little open secret with a lot of these platforms is the contract sizes, hardware costs, etc are so massive they come with multiple teams of dedicated engineers and internal expertise to get your application(s) up and running on them. Obviously these things are never quite "pull a docker container and run" and no one dropping eight-nine figures on these installs is going to do it without serious vendor backing and support.

It's part of the reason why AMD has had quite a bit of success here but is in single digit market share for "AI" otherwise.

Most people - even large orgs with thousands of GPUs - are so trapped in CUDA the theoretical on paper performance and cost benefits evaporate immediately when you spend all of your time trying to port everything over to the point you get equivalent performance and functionality.

candiddevmike2y ago

Got a source for that?

ComputerGuru2y ago

The original tweet makes the claim, but the tweet seems prone to hyperbole as well.

https://twitter.com/SawyerMerritt/status/1696011140508045660

mousetree2y ago

The original tweet quotes Elon Musk saying "Frankly...if they (NVIDIA) could deliver us enough GPUs, we might not need Dojo"

mousetree2y ago

$300 million for those 10,000

latchkey2y ago

much. much. more. You're not factoring in the disks, chassis, ram, networking gear, cabling, data center build, setup, install, etc etc etc...

kcb2y ago

> The firm also built a compute cluster fitted with 5,760 Nvidia A100 GPUs in June 2012

Wow, that's some really early hardware access. /s

Geee2y ago

Lol, I was wondering if A100 is really that old. Turns out A100 was released in 2020.

kcb2y ago

Yea I assume they meant 2021. 2012 was still the early days of GPU compute. Best we had were M2090s.

blackoil2y ago

Maybe, they picked up date when Elon first communicated that they are "ready" to go live. Like everything else it took a decade to materialize.

visarga2y ago

I only had about 3 NVIDIA H100 in 1980

kcb2y ago

Someone needs to figure out at what point all the compute in the world became more powerful than a single H100.

toomuchtodo2y ago

The Dojo is open.

kranke1552y ago

I thought Dojo was custom chips.

ComputerGuru2y ago

You are correct; it is, and flippant HN comments that are additionally incorrect are starting to become a thing. See the original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660

toomuchtodo2y ago

You’re being pedantic (rightfully so) and I’m being loose with words. While Dojo is the supercomputer Tesla built for vision training, I lumped anything contributing to their machine vision model training as Dojo. It’s called Dojo because that’s where the training takes place.

https://en.wikipedia.org/wiki/Tesla_Dojo

From the History section (although Technical Architecure is also worthy of consuming in its entirety):

> In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs.

I’ll take the L wrt being flippant if we’re using words very specifically in this context, that’s fair. It’s great to see Tesla expand its training resources is my sentiment, regardless of how their aggregate ML compute is segmented.

1 more reply

tomaytotomato2y ago

Pain does not exist in this Dojo. Kiai!

andrewmcwatters2y ago

Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.

KevcmkOP2y ago

Nvidia is powering a mega Tesla supercomputer powered by 10,000 H100 GPUs

ComputerGuru2y ago

Did you just repeat the headline?

j / k navigate · click thread line to collapse

129 comments

md_2y ago

So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?

Is the sole source for these articles this (https://twitter.com/SawyerMerritt/status/1696011140508045660) random Twitter post by some guy who runs an online clothing company?

I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

xedeon2y ago

> high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:

https://x.com/sawyermerritt/status/1696012091964915744

https://x.com/tim_zaman/status/1695488119729238147

Also, just FYI. Sawyer posts most of the Tesla and SpaceX breaking news on Twitter before major outlets even write their articles.

For example, here’s one just 12mins ago as confirmed by Elon: https://x.com/sawyermerritt/status/1728092021628313777

A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.

https://x.com/sawyermerritt/status/1709019899442479162

dpkirchner2y ago

I think you only see the additional tweets you're talking about if you're for whatever reason actually signed in to Twitter.

md_2y ago

> If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:

I don't see those when I scroll. I see

"Buckle up everyone, the acceleration of progress is about to get nutty!"

and this is the end of the post?

Maybe I'm misusing this thing?

> https://x.com/tim_zaman/status/1695488119729238147

But the use of future tense is a bit weird, right? And the lack of any followup?

> A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.

I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)

More seriously:

https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp... says Nvidia is producing 550k H100s in 2023. And there's obviously a significant lead-time requirement.

So, yes, I can sorta imagine Tesla pre-ordered 2% of global supply of H100s early in 2023 and was bragging about it at the end of August just 'cause.

Either way, I guess?

xedeon2y ago

> Maybe I'm misusing this thing?

That seems to be the case here. ;)

Another case of misuse? Here’s a tip for you. When you see a company logo/icon on someone's Twitter/X profile. That means they are verified to be affiliated with that org.

https://twitter.com/verified/status/1641596848921276417

Instead of inferring that Tim Zaman is a random Twitter user who paid $20 for a blue check. Why not just Google his name? ;)

https://letmegooglethat.com/?q=Tim+Zaman

> I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)

I linked a video where CNBC was interviewing Sawyer but it seems that you didn’t even bother to check it.

1 more reply

chollida12y ago

I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?

I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.

Is this comparable to what they would work with?

My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.

Or is this about what the leading competitors are working with?

Azure, for one, seems to have orders of magnitude more chips at their disposal.

jeffreyames2y ago

10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/

chollida12y ago

Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.

jeffreyames2y ago

rightbyte2y ago

I guess Azure's are spread out too. Latency higher to world wide datacentres.

latchkey2y ago

I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

None of it is easy.

LysPJ2y ago

Out of interest, what did you use all that compute for?

latchkey2y ago

ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

2 more replies

peteradio2y ago

Classified I imagine.

_zoltan_2y ago

H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.

latchkey2y ago

I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

1 more reply

joshhart2y ago

This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.

kcb2y ago

The most powerful listed supercomputer has 37,888 Radeon GPUs, so in the same order of magnitude.

jbverschoor2y ago

Interesting choice of words... I take you work for OpenAI? :) How large is their/'your' cluster? Probably the biggest in the world by now..

kkielhofner2y ago

Parent is almost certainly talking about Frontier, the supercomputer with the US Department of Energy[0].

[0] - https://top500.org/system/180047/

1 more reply

kcb2y ago

Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.

ben_w2y ago

2OEH8eoCRo02y ago

550k H100s? Who is buying these? They are hella expensive and China isn't allowed to have them.

ben_w2y ago

Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.

There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P

alecco2y ago

Government agencies.

ushakov2y ago

The Big Cloud

xvilka2y ago

It's a small cluster the size of large cluster.

jdiez172y ago

vardump2y ago

Probably they want to have all and any compute they can have. This doesn't exclude Dojo nor the previous generation nvidia chips they already got.

martin84122y ago

Vaporware, just like much of what Musk talks about.

s1gnp0st2y ago

Reusable rockets, electric cars, solar panels...

What would you say grants you the standing to opine here?

nickthegreek2y ago

All of his other false or misleading statements over the last 10 years.

1 more reply

mmcwilliams2y ago

I'm fairly certain all of those existed prior to Musk's suggestion of them.

3 more replies

martin84122y ago

https://elonmusk.today/

1 more reply

MarCylinder2y ago

Actually they're in the middle of production at TSMC. They have 10,000 units on order, to be delivered "in the coming year".

aik2y ago

What that he has talked about been vaporware?

tibbydudeza2y ago

That Telsa owners can use their cars to make money while they are working as robo taxis -let's just say he vastly underestimates effort it takes to make progress - FSD is not there yet.

1 more reply

martin84122y ago

https://elonmusk.today/

1 more reply

jeffbee2y ago

Dojo has always been a lie.

aik2y ago

Source? The article mentions they now have / use both.

millerm2y ago

Your assertion is inaccurate.

alecco2y ago

Original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660

Previus article: https://www.tomshardware.com/news/teslas-dollar300-million-a...

This is second-hand blogspam.

TonyTrapp2y ago

Tom's Hardware and Tech Radar belong to the same company. If you consider this to be blog spam, almost any news website these days would be blog spam.

queuebert2y ago

> almost any news website these days would be blog spam

Yes.

alecco2y ago

Almost everything is in the original tweet.

einpoklum2y ago

And the original tweet is very much kool-aid heavy, with "20x performance", "30x performance" claims about the switch from one card to the next.

dahart2y ago

> This AI cluster, worth more than $300 million, will offer a peak performance of 340 FP64 PFLOPS for technical computing and 39.58 INT8 ExaFLOPS for AI applications, according to Tom’s Hardware.

Also, obviously, int8 tensor ops aren’t ‘FLOPS’. I think Nvidia calls them “TOPS” (tensor ops). There is a separate metric for ‘tensor flops’ or TF32.

queuebert2y ago

dahart2y ago

> knowing you can drop to high precision when necessary without penalty is nice.

petermcneeley2y ago

queuebert2y ago

petermcneeley2y ago

Nit: INT8 is not a floating point operation and thus cannot be used in the term "ExaFLOPS"

throwaway4good2y ago

I predict it will run for 5 years and then come up with the answer: FSD needs lidar.

kaycebasques2y ago

n00b questions from someone just beginning to get interested in HPC

I see mention of using this supercomputer for training models. Is that the only purpose? What other types of things do orgs usually do with these supercomputers?

Are there any good boots-on-the-ground technical blogs that provide interesting detail on day-to-day experiences with these things?

abatilo2y ago

As opposed to keeping all of your servers independent of each other, super computers are used any time you want to pretend the entire computer is one computer.

In other words, they're used when you want to share some kind of state across all of the computers, without the potential overhead of communicating to some other system like a database.

Physics simulations and like, molecular modeling come to mind as common examples.

In the case of ML training, model parameters and broadcasting the deltas that get calculated during training are that shared state.

dsco2y ago

astrodust2y ago

Finding the largest prime is more a contest of who's willing to commit the most ridiculous amount of compute to the goal than it is a mathematical obstacle.

The cost of finding the next prime is likely into the millions now.

cactusplant73742y ago

Is FSD really a hardware problem for them?

amai2y ago

Do they also order a power plant for that cluster? Or how much energy does such a thing need?

bluelightning2k2y ago

It's funny - I'm listening to "The Founders" audiobook and right now they're telling the story of Elon Musk at PayPal wanting to rewrite for Windows server because Linux was too hard.

Weird to think that his next company's compute platform is this.

WendyTheWillow2y ago

Linux was a lot harder back then.

cactusplant73742y ago

Harder for who? Elon certainly didn't have the technical chops to work with it.

WendyTheWillow2y ago

Harder for everyone, including his staff, who were asking him to move to Windows…

2 more replies

jbverschoor2y ago

So THAT's why my power blipped

7e2y ago

Only 10K?

iamgopal2y ago

It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.

chollida12y ago

> It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.

What makes you think that Telsa, a company with far less AI workers and knowledge, and far less money than the above companies can out design and out build them?

md_2y ago

> What makes you think that Telsa, a company with far less AI workers and knowledge, an far less money than the above companies can out design and out build them?

Presumably because Elon himself will be involved in the design, and Elon, as we all know, is one of the world's great thinkers. ;)

1 more reply

kkielhofner2y ago

It's part of the reason why AMD has had quite a bit of success here but is in single digit market share for "AI" otherwise.

candiddevmike2y ago

Got a source for that?

ComputerGuru2y ago

The original tweet makes the claim, but the tweet seems prone to hyperbole as well.

https://twitter.com/SawyerMerritt/status/1696011140508045660

mousetree2y ago

The original tweet quotes Elon Musk saying "Frankly...if they (NVIDIA) could deliver us enough GPUs, we might not need Dojo"

mousetree2y ago

$300 million for those 10,000

latchkey2y ago

much. much. more. You're not factoring in the disks, chassis, ram, networking gear, cabling, data center build, setup, install, etc etc etc...

kcb2y ago

> The firm also built a compute cluster fitted with 5,760 Nvidia A100 GPUs in June 2012

Wow, that's some really early hardware access. /s

Geee2y ago

Lol, I was wondering if A100 is really that old. Turns out A100 was released in 2020.

kcb2y ago

Yea I assume they meant 2021. 2012 was still the early days of GPU compute. Best we had were M2090s.

blackoil2y ago

Maybe, they picked up date when Elon first communicated that they are "ready" to go live. Like everything else it took a decade to materialize.

visarga2y ago

I only had about 3 NVIDIA H100 in 1980

kcb2y ago

Someone needs to figure out at what point all the compute in the world became more powerful than a single H100.

toomuchtodo2y ago

The Dojo is open.

kranke1552y ago

I thought Dojo was custom chips.

ComputerGuru2y ago

You are correct; it is, and flippant HN comments that are additionally incorrect are starting to become a thing. See the original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660

toomuchtodo2y ago

https://en.wikipedia.org/wiki/Tesla_Dojo

From the History section (although Technical Architecure is also worthy of consuming in its entirety):

> In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs.

1 more reply

tomaytotomato2y ago

Pain does not exist in this Dojo. Kiai!

andrewmcwatters2y ago

Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.

KevcmkOP2y ago

Nvidia is powering a mega Tesla supercomputer powered by 10,000 H100 GPUs

ComputerGuru2y ago

Did you just repeat the headline?

j / k navigate · click thread line to collapse