Infrastructure setup and open-source scripts to train 70B model from bare metal (opens in new tab)

(imbue.com)

325 pointsthejash1y ago41 comments

41 comments

In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks. Using our cluster for high performance training meant that every component — InfiniBand, Ethernet, GPUs, and the nodes themselves — had to work perfectly. If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run.

We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.

This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: https://imbue.com/research/70b-intro/

Thoughts and questions welcome! :)

vessenes1y ago

Loved this and the detail - thank you. It’s the best inside detail on the engineering work behind these models I’ve ever read.

Two things I’m curious about- first, what, if any difference would you imagine in training a 400b parameter model? It seems that you have plenty of vram across the cluster, but I want to know what you think.

Second, do you think this sort of architecture is the end game for model training? It seems sooo fragile. Are there better shared training mechanisms/architectures? Are there better cluster geometries?

Thanks again - great read.

ipsum21y ago

What happened to the Minecraft-like 3d world your team built? Did you guys pivot?

1 more reply

highfrequency1y ago

> outperformed zero-shot GPT-4o

Cool stuff! Does this do RLHF or just pretraining? If the latter, how did you manage to beat GPT 4?

Flumio1y ago

Nice. Tx for the write up

chx1y ago

> If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run

It's an unusual enough sentence to be remarkable and I was like "I read this exact same sentence before". Indeed, this and most of the writeup appeared on Twitter, LinkedIn, Reddit it seems word-by-word. Is this just spam ?

https://x.com/imbue_ai/status/1805629547473518695

https://reddit.com/r/learnmachinelearning/comments/1dobgbs/t...

https://www.linkedin.com/posts/mattboulos_training-a-70b-mod...

lolinder1y ago

This is the kind of criticism that could only come from someone without much formal writing experience.

This is a very normal workflow: You write a full-length text detailing the project you worked on. You then trim it down to a summary which you share with a group of people X. You then trim it down into a different summary which you share with a group of people Y.

When you do this multiple times you unsurprisingly end up with some sentences that make it into multiple summaries because they're that important to the thesis!

(Also, the summaries on Twitter and Reddit aren't anything close to "most of the writeup"—the full text is 6000+ words!)

1 more reply

neilv1y ago

I'd rather some company copy&paste the same text multiple places -- if the alternative was that those places would instead get obfuscation of the same information to appear novel each time (so I'd have to read all of them to realize they're all just the same info).

fastasucan1y ago

I dont inderstand your issue with this. Is it that they share their work several places, or that they don't describe their work in an unique way every time?

exe341y ago

i prefer this, to the story about that time they went to Florence and their grandma made pizza for dinner and they got the recipe.

ac291y ago

Eh, seems like legit marketing to me. Yes, they are trying to sell you something, but they are doing that by releasing non-trivial research and open source code.

leothetechguy1y ago

The same company reports multiple times on a finding they've made through multiple social media channels? Shocking. /s

bottled_poe1y ago

[flagged]

1 more reply

alias_neo1y ago

> This post focuses on one cluster that had 4,092 H100 GPUs spread across 511 computers, with eight GPUs to a computer

Am I right in understanding, that's over $100 Million worth of GPUs?

I wonder what/when/if any of this will be within the realms of an enthusiast with a gaming-pc budget.

ec1096851y ago

100M in GPUs and they are futzing with dell boxes that have bad Ethernet ports.

Interesting to hear about all the problems they ran into!

freeqaz1y ago

Looks correct. They raised $200m from NVIDIA which I presume is in pure GPUs. https://news.crunchbase.com/ai-robotics/new-ai-unicorn-imbue...

mandeepj1y ago

> Am I right in understanding, that's over $100 Million worth of GPUs?

Ha! I guess most or many of the readers (who don't have that much of funding) should jump to the next HN submission

renewiltord1y ago

This is hella cool. Cisco has a new nvidia collab with 800G per-port. I don’t recall if it was RoCE or not. The infiniband is accessible by the GPUs here? Beautiful.

Thank you for sharing all this. One of the more directly useful posts.

loudmax1y ago

This was discussed on the Latent Space podcast a few days ago: https://www.latent.space/p/llm-training-2024

That was a good episode, worth listening to for hearing justifications behind some of these decisions.

swyx1y ago

thank you for listening!

im not used to conducting these kinds of interviews and felt out of my depth. please suggestions questions that you felt should have been asked but werent.

lifeisstillgood1y ago

I am fascinated by the total electrical power drawn to build models - power and cooling I guess. Do you have any numbers on that (the point being Zuckerberg in a podcast suggested the next 1GW model was being planned - basically a data centre with a mid sized power plant attached)

omerhac1y ago

This is such a valuable piece. I've learned so much reading it! And your open-source code is great as well.

Some open questions I have: 1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches? 2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms) 3) Can you share a bit more about your logging infra apart from the fact that it was Loki based? 4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?

Thanks!

mmastrac1y ago

Honest question: why is there so much PC hardware in the mix here? Why don't we have PCI + infiniband backends with GPUs and a little tiny orchestrating ARM controller and just let them all coordinate with each other? Is it just "momentum" from previous designs and/or lack of "market" for specialized GPU controllers?

bick_nyers1y ago

Are you asking why pay extra for a CPU and RAM? Not everything can be done on a GPU, for example, .png decompression. If you really analyzed your training code and preprocessed your data substantially you could probably get away with very lightweight CPU/RAM resources but I think the reality is that it's such a minor contribution of cost to the overall system (GPU are expensive) that wasting development cycles on that degree of optimization isn't strictly necessary. When you're a hyperscaler you are likely chasing those fractions of a percent of cost efficiency though. To use my original example, you would likely want to preprocess your .png to either .webp (multi-threaded lossless) or .jpeg (lossy), but likely it wouldn't make sense to turn it into a GPU decompressible format as you would save on CPU cost during training but would pay more in storage (and maybe transfer) cost.

Edit: To be more clear, if the CPU work is bottlenecking training, you want to optimize that as much as possible by preprocessing your data/tweaking training scripts. What I'm discussing here is the gap between "fast enough" and "faster":

CPU is not fast enough for training < CPU is exactly fast enough for training < CPU is faster than needed for training

ianburrell1y ago

Cause when you have quarter million dollars of GPU on each machine, it is dumb to worry about few thousand for the controlling hardware. Too risky to use something new.

Another problem is that all the hardware, drivers, and experience for GPU are on PC. It would take a lot of work to get running on ARM since would be starting from scratch. Then more work to get it stable. All to save a little on processor.

jononor1y ago

Keeping the GPUs feed is a actually a rather demanding job for deep learning training. I do not have experience with LLM/NLP, but for image and audio workloads one can struggle to reach full utilization of even a RTX2/3/4xxx GPU with a typical 4-8 core CPU. It does not take much to be bottlenecked by the CPU and/or IO.

instagib1y ago

4,092 H100 GPUs.

They’re working on “self-coding”. No-code or minimal code solutions or?

Quite a few articles and such people may be interested in also on their website: https://imbue.com/our-work/

weinzierl1y ago

How much did it cost? Overall, from nothing to the usable model files, in hardware cost, development hours and ultimately electricity and cooling?

wkat42421y ago

I wonder if it's possible for a huge number of hobbyists to team up and train a model together in a distributed manner like seti@home or folding@home. Or does this kind of workload not really lend itself to that approach?

Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.

bigiain1y ago

Not likely to work. Not many (any) hobbyists can get 400gbps network throughput between each other's GPUs...

john2x1y ago

once the model is trained, what happens to the hardware and infrastructure?

trashtester1y ago

Voltage Park is a cloud provider. This is no different from renting barebone infra from AWS, GCP or Azure.

Except Voltage Park, being smaller, is probably more willing to provide some customized setup.

Indeed, they may even see it as a learning opportunity for when they rent similar setups to other customers.

ec1096851y ago

Imbue was telling Voltage Park how to setup and wire their network and booting from bare metal, so it’s definitely lower level than what the big clouds provide access to.

pvg1y ago

It probably isn't the answer but should be - LAN party.

rvnx1y ago

GPUs will be reused for mining Monero and exfiltrate money to the founders at the expense of the investors.

Oops, don't tell I told you.

EDIT: Sorry Dogecoin, thanks to the tip!

2 more replies

gostsamo1y ago

Either training the next model or inference for the already trained one. In some cases, you might even offer it as a service.

mikewarot1y ago

It would be quite interesting to see the same hardware used to repeat the training, but with raw Unicode, instead of tokenized training data.

I'd like to see the difference in performance on spelling and rhymes.

j / k navigate · click thread line to collapse

41 comments

thejashOP1y ago

We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.

Thoughts and questions welcome! :)

vessenes1y ago

Loved this and the detail - thank you. It’s the best inside detail on the engineering work behind these models I’ve ever read.

Thanks again - great read.

ipsum21y ago

What happened to the Minecraft-like 3d world your team built? Did you guys pivot?

1 more reply

highfrequency1y ago

> outperformed zero-shot GPT-4o

Cool stuff! Does this do RLHF or just pretraining? If the latter, how did you manage to beat GPT 4?

Flumio1y ago

Nice. Tx for the write up

chx1y ago

> If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run

https://x.com/imbue_ai/status/1805629547473518695

https://reddit.com/r/learnmachinelearning/comments/1dobgbs/t...

https://www.linkedin.com/posts/mattboulos_training-a-70b-mod...

lolinder1y ago

This is the kind of criticism that could only come from someone without much formal writing experience.

When you do this multiple times you unsurprisingly end up with some sentences that make it into multiple summaries because they're that important to the thesis!

(Also, the summaries on Twitter and Reddit aren't anything close to "most of the writeup"—the full text is 6000+ words!)

1 more reply

neilv1y ago

fastasucan1y ago

I dont inderstand your issue with this. Is it that they share their work several places, or that they don't describe their work in an unique way every time?

exe341y ago

i prefer this, to the story about that time they went to Florence and their grandma made pizza for dinner and they got the recipe.

ac291y ago

Eh, seems like legit marketing to me. Yes, they are trying to sell you something, but they are doing that by releasing non-trivial research and open source code.

leothetechguy1y ago

The same company reports multiple times on a finding they've made through multiple social media channels? Shocking. /s

bottled_poe1y ago

[flagged]

1 more reply

alias_neo1y ago

> This post focuses on one cluster that had 4,092 H100 GPUs spread across 511 computers, with eight GPUs to a computer

Am I right in understanding, that's over $100 Million worth of GPUs?

I wonder what/when/if any of this will be within the realms of an enthusiast with a gaming-pc budget.

ec1096851y ago

100M in GPUs and they are futzing with dell boxes that have bad Ethernet ports.

Interesting to hear about all the problems they ran into!

freeqaz1y ago

Looks correct. They raised $200m from NVIDIA which I presume is in pure GPUs. https://news.crunchbase.com/ai-robotics/new-ai-unicorn-imbue...

mandeepj1y ago

> Am I right in understanding, that's over $100 Million worth of GPUs?

Ha! I guess most or many of the readers (who don't have that much of funding) should jump to the next HN submission

renewiltord1y ago

This is hella cool. Cisco has a new nvidia collab with 800G per-port. I don’t recall if it was RoCE or not. The infiniband is accessible by the GPUs here? Beautiful.

Thank you for sharing all this. One of the more directly useful posts.

loudmax1y ago

This was discussed on the Latent Space podcast a few days ago: https://www.latent.space/p/llm-training-2024

That was a good episode, worth listening to for hearing justifications behind some of these decisions.

swyx1y ago

thank you for listening!

im not used to conducting these kinds of interviews and felt out of my depth. please suggestions questions that you felt should have been asked but werent.

lifeisstillgood1y ago

omerhac1y ago

This is such a valuable piece. I've learned so much reading it! And your open-source code is great as well.

Thanks!

mmastrac1y ago

bick_nyers1y ago

CPU is not fast enough for training < CPU is exactly fast enough for training < CPU is faster than needed for training

ianburrell1y ago

Cause when you have quarter million dollars of GPU on each machine, it is dumb to worry about few thousand for the controlling hardware. Too risky to use something new.

jononor1y ago

instagib1y ago

4,092 H100 GPUs.

They’re working on “self-coding”. No-code or minimal code solutions or?

Quite a few articles and such people may be interested in also on their website: https://imbue.com/our-work/

weinzierl1y ago

How much did it cost? Overall, from nothing to the usable model files, in hardware cost, development hours and ultimately electricity and cooling?

wkat42421y ago

Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.

bigiain1y ago

Not likely to work. Not many (any) hobbyists can get 400gbps network throughput between each other's GPUs...

john2x1y ago

once the model is trained, what happens to the hardware and infrastructure?

trashtester1y ago

Voltage Park is a cloud provider. This is no different from renting barebone infra from AWS, GCP or Azure.

Except Voltage Park, being smaller, is probably more willing to provide some customized setup.

Indeed, they may even see it as a learning opportunity for when they rent similar setups to other customers.

ec1096851y ago

Imbue was telling Voltage Park how to setup and wire their network and booting from bare metal, so it’s definitely lower level than what the big clouds provide access to.

pvg1y ago

It probably isn't the answer but should be - LAN party.

rvnx1y ago

GPUs will be reused for mining Monero and exfiltrate money to the founders at the expense of the investors.

Oops, don't tell I told you.

EDIT: Sorry Dogecoin, thanks to the tip!

2 more replies

gostsamo1y ago

Either training the next model or inference for the already trained one. In some cases, you might even offer it as a service.

mikewarot1y ago

It would be quite interesting to see the same hardware used to repeat the training, but with raw Unicode, instead of tokenized training data.

I'd like to see the difference in performance on spelling and rhymes.

j / k navigate · click thread line to collapse