We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.
This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: https://imbue.com/research/70b-intro/
Thoughts and questions welcome! :)
Two things I’m curious about- first, what, if any difference would you imagine in training a 400b parameter model? It seems that you have plenty of vram across the cluster, but I want to know what you think.
Second, do you think this sort of architecture is the end game for model training? It seems sooo fragile. Are there better shared training mechanisms/architectures? Are there better cluster geometries?
Thanks again - great read.
Cool stuff! Does this do RLHF or just pretraining? If the latter, how did you manage to beat GPT 4?
It's an unusual enough sentence to be remarkable and I was like "I read this exact same sentence before". Indeed, this and most of the writeup appeared on Twitter, LinkedIn, Reddit it seems word-by-word. Is this just spam ?
https://x.com/imbue_ai/status/1805629547473518695
https://reddit.com/r/learnmachinelearning/comments/1dobgbs/t...
https://www.linkedin.com/posts/mattboulos_training-a-70b-mod...
This is a very normal workflow: You write a full-length text detailing the project you worked on. You then trim it down to a summary which you share with a group of people X. You then trim it down into a different summary which you share with a group of people Y.
When you do this multiple times you unsurprisingly end up with some sentences that make it into multiple summaries because they're that important to the thesis!
(Also, the summaries on Twitter and Reddit aren't anything close to "most of the writeup"—the full text is 6000+ words!)
Am I right in understanding, that's over $100 Million worth of GPUs?
I wonder what/when/if any of this will be within the realms of an enthusiast with a gaming-pc budget.
Interesting to hear about all the problems they ran into!
Ha! I guess most or many of the readers (who don't have that much of funding) should jump to the next HN submission
Thank you for sharing all this. One of the more directly useful posts.
That was a good episode, worth listening to for hearing justifications behind some of these decisions.
im not used to conducting these kinds of interviews and felt out of my depth. please suggestions questions that you felt should have been asked but werent.
Some open questions I have: 1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches? 2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms) 3) Can you share a bit more about your logging infra apart from the fact that it was Loki based? 4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?
Thanks!
Edit: To be more clear, if the CPU work is bottlenecking training, you want to optimize that as much as possible by preprocessing your data/tweaking training scripts. What I'm discussing here is the gap between "fast enough" and "faster":
CPU is not fast enough for training < CPU is exactly fast enough for training < CPU is faster than needed for training
Another problem is that all the hardware, drivers, and experience for GPU are on PC. It would take a lot of work to get running on ARM since would be starting from scratch. Then more work to get it stable. All to save a little on processor.
They’re working on “self-coding”. No-code or minimal code solutions or?
Quite a few articles and such people may be interested in also on their website: https://imbue.com/our-work/
Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.
Except Voltage Park, being smaller, is probably more willing to provide some customized setup.
Indeed, they may even see it as a learning opportunity for when they rent similar setups to other customers.
Oops, don't tell I told you.
EDIT: Sorry Dogecoin, thanks to the tip!
I'd like to see the difference in performance on spelling and rhymes.