Deploying Llama3 70B on AWS – GPU Requirement, Cost and Step-by-Step Guide (opens in new tab)

(slashml.com)

3 pointsJJneid2y ago5 comments

5 comments

5 comments · 1 top-level

rini172y ago· 4 in thread

Note that quantized versions of llama3 70B can be ran on CPU on much cheaper server. I am personally using it via llama.cpp on bare metal 6-core Xeon CPU with 128G RAM for ~50 euro monthly.

JJneidOP2y ago

Is inference speed an issue for you?

rini172y ago

Sufficient for fluent conversation.

JJneidOP2y ago

usually performance takes a hit with quantization. are you getting quality responses?

rini172y ago

Since llama3, yes, quite satisfying.

j / k navigate · click thread line to collapse

5 comments

5 comments · 1 top-level

rini172y ago· 4 in thread

Note that quantized versions of llama3 70B can be ran on CPU on much cheaper server. I am personally using it via llama.cpp on bare metal 6-core Xeon CPU with 128G RAM for ~50 euro monthly.

JJneidOP2y ago

Is inference speed an issue for you?

rini172y ago

Sufficient for fluent conversation.

JJneidOP2y ago

usually performance takes a hit with quantization. are you getting quality responses?

rini172y ago

Since llama3, yes, quite satisfying.

j / k navigate · click thread line to collapse