Show HN: I made a GPU VRAM calculator for transformer-based models (opens in new tab)

(vram.asmirnov.xyz)

135 pointsfuriousteabag2y ago38 comments

38 comments

28 comments · 10 top-level

samspenc2y ago· 6 in thread

Consumer grade GPUs like NVidia's 3090 and 4090 max out at 24 GB VRAM, and those cost $1000-2000 each. You can get higher VRAM but need enterprise GPUs which are in the five figures, easily starting at $30K a pop.

Per this calculator, for training, only gpt2-large and gpt2-medium would work with those two top-of-the-line GPUs.

For inference it's certainly a bit better, only the Llama-2-70b-hf and Llama-2-13b-hf don't fit in that much VRAM, all the other models do.

alexhutcheson2y ago

Nvidia’s workstation cards are available with more RAM than the consumer cards, at a lower price than the datacenter cards. RTX 6000 Ada has 48 GB VRAM and retails for $6800, and RTX 5000 Ada has 32 GB VRAM and retails for $4000[1].

Very large models have to be distributed across multiple GPUs though, even if you’re using datacenter chips like H100s.

[1] https://store.nvidia.com/en-us/nvidia-rtx/store/

slabity2y ago

Other than power consumption, is there any reason to prefer a single workstation card over multiple consumer cards then?

A single $6800 RTX 6000 Ada with 48GB of VRAM vs 6x 7900XTX with a combined total of 144GB of VRAM honestly makes this seem like a no brainer to me.

2 more replies

mciancia2y ago

Used 3090 are going for ~600usd these days (at least in Europe) thanks to crypto mining crash - building a workstation with 2 of these is fairly easy for 48GB of vram, with 4 a bit more tricky but still doable and affordable IMO

ngoro7bd2y ago

Recently bought 24GB 3090 for $700 in US. Used but never tweaked, runs stable for 6 months despite heavy workloads.

nVidias play seems obvious. Game graphics don’t move that fast these days. Used market flush with 3090s and down is fine to them while they focus on extracting top dollar from fast moving AI researchers/VCs

namibj2y ago

You can easily use pipeline-parallelism though. Especially if you have 8-16 lanes of PCIe4 with direct P2P access between the cards.

IIRC you want micro-batching though, to overlap pipeline phases.

nox1002y ago

I haven't a clue how they compare but a Studio Mac with an M2 Ultra can get 192GB of unified ram for $5700 (PS: not a mac fan, a curious)

a_wild_dandan2y ago· 5 in thread

Are people still rawdoggin' 16-bit models? I almost exclusively use 5-bit inference quants (or 8-bit natives like Yi-34b) on my MacBook Pro. Tiny accuracy loss, runs fast, and leave plenty of (V)RAM on the table. Mixtral 8x7 is my new daily driver, and only takes like 40GB to run! I wonder if I could run two of them talking to each other...

rubatuga2y ago

Pure 16bit is horrible for training, sorry.

rdedev2y ago

Doesn't using bf16 alleviate the problem? At least I've had success training a Bert like model from scratch

2 more replies

bigdict2y ago

Hmm, what do you mean? I thought bf16 is used extensively for LLM training.

chrsig2y ago

How does one rawdog a 16-bit model?

kkzz992y ago

Usually, for efficiency, you use quantized models. Quantized models reduce the number of bits available for each parameter, saving space and reduce RAM usage.

roseway42y ago· 3 in thread

While not as pretty (and mobile-friendly) as the original link, the calculators below support modeling LoRA-based training, alongside full finetuning.

https://huggingface.co/spaces/Vokturz/can-it-run-llm

https://rahulschand.github.io/gpu_poor/

ComputerGuru2y ago

They seem to be broken when I try any HF ids besides what came preconfigured. e.g. just tried brucethemoose/Yi-34B-200K-DARE-merge-v5-3.1bpw-exl2-fiction or LoneStriker/shisa-7b-v1-3.0bpw-h6-exl2

3abiton2y ago

Been looking for something like thos for a while! I googled a lot, and this link never popped up. I feel google search is regressing.

icelancer2y ago

Second link hasn't been working for awhile.

ilaksh2y ago· 2 in thread

Does this have an option for quantization levels? Don't think I saw it.

furiousteabagOP2y ago

There is no option to select quantized version yet. Will work on that!

ComputerGuru2y ago

I second the request for quantization, eg for exl2.

a21282y ago· 1 in thread

I noticed the default parameter count value is 1.418 billion but if you erase it you can't actually enter it back because you can't type a decimal point in the input area. Also, you can't enter parameter counts smaller than 1 billion

sp3322y ago

It works if you type the digits first and then insert the decimal point after.

thatguysaguy2y ago· 1 in thread

This only lists first moments, but Adam stores estimates of first and second moments.

furiousteabagOP2y ago

By default, SGD w momentum is enabled as optimizer. You may try selecting Adam and it will list second moments as well.

cchance2y ago

Very nice, would be cool to have a little i next to each spot to explain what each thing is for newer users (batch size, etc)

lgkk2y ago

On mobile, iOS specifically in safari, your drop downs are hard to use. I’m not able to dismiss the keyboard. Is that an issue on my end?

_giorgio_2y ago

What is the usual way to do it inside the python file that defined the model?

twayt2y ago

This is actually pretty useful

j / k navigate · click thread line to collapse

38 comments

28 comments · 10 top-level

samspenc2y ago· 6 in thread

Per this calculator, for training, only gpt2-large and gpt2-medium would work with those two top-of-the-line GPUs.

For inference it's certainly a bit better, only the Llama-2-70b-hf and Llama-2-13b-hf don't fit in that much VRAM, all the other models do.

alexhutcheson2y ago

Very large models have to be distributed across multiple GPUs though, even if you’re using datacenter chips like H100s.

[1] https://store.nvidia.com/en-us/nvidia-rtx/store/

slabity2y ago

Other than power consumption, is there any reason to prefer a single workstation card over multiple consumer cards then?

A single $6800 RTX 6000 Ada with 48GB of VRAM vs 6x 7900XTX with a combined total of 144GB of VRAM honestly makes this seem like a no brainer to me.

2 more replies

mciancia2y ago

ngoro7bd2y ago

Recently bought 24GB 3090 for $700 in US. Used but never tweaked, runs stable for 6 months despite heavy workloads.

namibj2y ago

You can easily use pipeline-parallelism though. Especially if you have 8-16 lanes of PCIe4 with direct P2P access between the cards.

IIRC you want micro-batching though, to overlap pipeline phases.

nox1002y ago

I haven't a clue how they compare but a Studio Mac with an M2 Ultra can get 192GB of unified ram for $5700 (PS: not a mac fan, a curious)

a_wild_dandan2y ago· 5 in thread

rubatuga2y ago

Pure 16bit is horrible for training, sorry.

rdedev2y ago

Doesn't using bf16 alleviate the problem? At least I've had success training a Bert like model from scratch

2 more replies

bigdict2y ago

Hmm, what do you mean? I thought bf16 is used extensively for LLM training.

chrsig2y ago

How does one rawdog a 16-bit model?

kkzz992y ago

Usually, for efficiency, you use quantized models. Quantized models reduce the number of bits available for each parameter, saving space and reduce RAM usage.

roseway42y ago· 3 in thread

While not as pretty (and mobile-friendly) as the original link, the calculators below support modeling LoRA-based training, alongside full finetuning.

https://huggingface.co/spaces/Vokturz/can-it-run-llm

https://rahulschand.github.io/gpu_poor/

ComputerGuru2y ago

They seem to be broken when I try any HF ids besides what came preconfigured. e.g. just tried brucethemoose/Yi-34B-200K-DARE-merge-v5-3.1bpw-exl2-fiction or LoneStriker/shisa-7b-v1-3.0bpw-h6-exl2

3abiton2y ago

Been looking for something like thos for a while! I googled a lot, and this link never popped up. I feel google search is regressing.

icelancer2y ago

Second link hasn't been working for awhile.

ilaksh2y ago· 2 in thread

Does this have an option for quantization levels? Don't think I saw it.

furiousteabagOP2y ago

There is no option to select quantized version yet. Will work on that!

ComputerGuru2y ago

I second the request for quantization, eg for exl2.

a21282y ago· 1 in thread

sp3322y ago

It works if you type the digits first and then insert the decimal point after.

thatguysaguy2y ago· 1 in thread

This only lists first moments, but Adam stores estimates of first and second moments.

furiousteabagOP2y ago

By default, SGD w momentum is enabled as optimizer. You may try selecting Adam and it will list second moments as well.

cchance2y ago

Very nice, would be cool to have a little i next to each spot to explain what each thing is for newer users (batch size, etc)

lgkk2y ago

On mobile, iOS specifically in safari, your drop downs are hard to use. I’m not able to dismiss the keyboard. Is that an issue on my end?

_giorgio_2y ago

What is the usual way to do it inside the python file that defined the model?

twayt2y ago

This is actually pretty useful

j / k navigate · click thread line to collapse