undefined | Better HN

0 pointswkat42421y ago0 comments

Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

0 comments

Yeah, i have a 3090 and 64gb of ram. I can run a 8x7B and get pretty decent performance out of it with partial offloading.

wkat4242OP1y ago

Really?? For me it's terrible doing that. I also have 64GB RAM but meh. It's so bad when I can no longer offload everything. The tokens literally drizzle in. With full offloading they appear faster than I can read (8B llama3 with 8 bit quant). On a Radeon Pro VII with 16GB (HBM2 memory!)

wing-_-nuts1y ago

Oh man, I hate to say it, but it's likely your amd card. Yes, they can run LLMs and SD, but badly. Larger models are usable for me with partial offloading, but you're right that full loading the model in vram is really preferable.

1 more reply

j / k navigate · click thread line to collapse

0 comments

wing-_-nuts1y ago

Yeah, i have a 3090 and 64gb of ram. I can run a 8x7B and get pretty decent performance out of it with partial offloading.

wkat4242OP1y ago

wing-_-nuts1y ago

1 more reply

j / k navigate · click thread line to collapse