AMD Open-Source 1B OLMo Language Models (opens in new tab)

(amd.com)

78 pointsebalit1y ago33 comments

33 comments

Training a 1B model on 1T tokens is cheaper than people might think. A H100 GPU can be rented for 2.5$ per hour and can train around 63k tokens per second for a 1B model. So you would need around 4,400 hours of GPU training costing only $11k And costs will keep going down.

lumost1y ago

Is there a handy table for this? My napkin math has either underestimated throughput by 2 orders of magnitude or the above estimate is high.

YetAnotherNick1y ago

You require 6 * parameter * token flops[1] to train LLM. Which means (flop/s of H100 * MFU) / (6 * parameter) token per second. Assuming MFU of 40%, it is (1000 * 10^12 * 0.4) / (6 * 10^9) token/sec = 67,000 token/sec.

This repo[2] by Meta achieves 48% MFU, or 80k token/second.

[1]: https://arxiv.org/pdf/2001.08361

[2]: https://github.com/facebookresearch/lingua

codetrotter1y ago

(1,000,000,000,000/63,000)/(60*60)

(1T tokens / 63k tokens per second) / (60 seconds per minute * 60 minutes per hour)

Is approx 4400 hours

So I guess that’s how the calculation went.

Or did you mean a source for the number of tokens per second?

lumost1y ago

Tokens per second ;) I can do the arithmetic on my own.

throwaway888abc1y ago

"Furthermore, AMD OLMo models were also able to run inference on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs). Developers can easily run Generative AI models locally by utilizing the AMD Ryzen™ AI Software."

Hope these AI PCs will run also something better than 1B model.

What is it useful for ? Spellcheck ?

lumost1y ago

The point is that AMD is doing the legwork to ensure that AI models can run on their chips. While they could settle for inference workloads (port llama to AMD). It is unlikely that many teams will widely adopt their silicon unless they can be used in the end-end ML stack. Many pure OSS efforts have tried and failed to make AMD work for this use case.

As a chip maker - they will also have some undersold, QA, or otherwise wasted parts available for these training efforts - so the capex is likely less severe for them compared to a random startup betting on AMD.

cyberax1y ago

It's amazing how NVidia became worth $3T simply because they have better drivers and CUDA.

AMD has great hardware, but they never could be assed to do anything about their software.

nmstoker1y ago

"utilizing the AMD Ryzen™ AI Software* sounds really unappealing! Like when companies don't realise you think their software to leverage hardware is bad and you'd prefer being able to use features via something generic

anon2911y ago

It's not really. Anyone who's ever done any low-level assembly coding on modern chips knows that it is already a herculean engineering effort. The idea that your customers, who are experts in machine learning models (like transformers, activation functions, etc) are going to feel comfortable with memory hierarchies, synchronization, floating point precision, etc is just crazy.

1 more reply

teleforce1y ago

Never underestimate development eco-system. Ballmer was famously repeatedly shouting developers many times in one of the Microsoft Windows conferences and now he's one of the richest persons. Microsoft also got out of their ways by introducing WSL for running Linux alongside Windows when they realized the majority of OS running their Azure cloud are Linux.

princearthur1y ago

Some use cases require a small memory footprint, e.g. parallel inferences. I suppose there are also dark patterns like tracking, where you don't want the load to stand out.

Havoc1y ago

It’s less size of model and more mem throughout and npu tops that’s the limiting factor for this class of device

Which means you can do larger but it’ll become ever slower

sireat1y ago

Baby steps, but how useful is a 1B model these days?

It seems actual domain specific usefulness (say specific programming language, translation, etc) starts at 3B models.

adt1y ago

https://lifearchitect.ai/models-table/

j / k navigate · click thread line to collapse

33 comments

duchenne1y ago

lumost1y ago

Is there a handy table for this? My napkin math has either underestimated throughput by 2 orders of magnitude or the above estimate is high.

YetAnotherNick1y ago

This repo[2] by Meta achieves 48% MFU, or 80k token/second.

[1]: https://arxiv.org/pdf/2001.08361

[2]: https://github.com/facebookresearch/lingua

codetrotter1y ago

(1,000,000,000,000/63,000)/(60*60)

(1T tokens / 63k tokens per second) / (60 seconds per minute * 60 minutes per hour)

Is approx 4400 hours

So I guess that’s how the calculation went.

Or did you mean a source for the number of tokens per second?

lumost1y ago

Tokens per second ;) I can do the arithmetic on my own.

throwaway888abc1y ago

Hope these AI PCs will run also something better than 1B model.

What is it useful for ? Spellcheck ?

lumost1y ago

cyberax1y ago

It's amazing how NVidia became worth $3T simply because they have better drivers and CUDA.

AMD has great hardware, but they never could be assed to do anything about their software.

nmstoker1y ago

anon2911y ago

1 more reply

teleforce1y ago

princearthur1y ago

Some use cases require a small memory footprint, e.g. parallel inferences. I suppose there are also dark patterns like tracking, where you don't want the load to stand out.

Havoc1y ago

It’s less size of model and more mem throughout and npu tops that’s the limiting factor for this class of device

Which means you can do larger but it’ll become ever slower

sireat1y ago

Baby steps, but how useful is a 1B model these days?

It seems actual domain specific usefulness (say specific programming language, translation, etc) starts at 3B models.

adt1y ago

https://lifearchitect.ai/models-table/

j / k navigate · click thread line to collapse