Llama Is Expensive (opens in new tab)

(cursor.so)

14 pointsrazcle2y ago9 comments

9 comments

9 comments · 3 top-level

rvz2y ago· 2 in thread

> As a massive disclaimer, a reason to use LLama over gpt-3.5 is finetuning. In this post, we only explore cost and latency. I don't compare LLama-2 to GPT-4, as it is closer to a 3.5-level model. Given the discourse on twitter, it seems Llama-2 still trails behind gpt-3.5-turbo. Benchmark performance also supports this claim:

Well one other massive disclaimer is that the author is "Backed by OpenAI"'s Startup Fund which they failed to disclose in the post.

So of course they would speculate that. This post is essentially a paid marketing piece by OpenAI, who is the lead investor in Anysphere (creators of Cursor)

sualehasif2y ago

a founder from anysphere here. while we are funded by openai startup fund, we have no other formal relations by them. we love oss models and are exploring how we can use them for amazing code products. the post is our honest reflection into the tradeoffs we see :)

QuinnyPig2y ago

I hear you, but disclosing the relationship up front is the best way to disarm suspicions of bias.

brucethemoose22y ago· 2 in thread

> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)

Well there is your problem.

LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)

https://github.com/turboderp/exllama#dual-gpu-results

And this is without any consideration of batching (which I am not familiar with TBH).

Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.

There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.

amanrs2y ago

The key misconception about many quantization methods is that lower precision = better speed.

I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf

It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5

For larger batch sizes, bf16 costs dip below 3-bit quantized.

brucethemoose22y ago

exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).

And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.

joefourier2y ago· 2 in thread

You don’t have to run Llama 70B on a rented 2xA100 80GB which is of course going to be quite pricy. Quantising it to 4-bit as brucethemoose2 mentioned allows you to run it on far cheaper hardware - it’ll fit on a single A6000 which can be rented for as low as $0.44/h, 10x cheaper than the $4.42/h they mentioned for their 2x A100 80GB (speed might be impacted but it shouldn’t be 10x slower).

And if you’re running it on your own machine, then the cost of using Llama is just your electricity bill - you can theoretically run it on 2x 3090 which are now quite cheap to buy, or on a CPU with enough RAM (but it will be very very slow).

shostack2y ago

What are my options for running llama 2 on a single 3080?

brucethemoose22y ago

Llama.cpp, through kobold.cpp, offloading some of it to ram.

j / k navigate · click thread line to collapse

9 comments

9 comments · 3 top-level

rvz2y ago· 2 in thread

Well one other massive disclaimer is that the author is "Backed by OpenAI"'s Startup Fund which they failed to disclose in the post.

So of course they would speculate that. This post is essentially a paid marketing piece by OpenAI, who is the lead investor in Anysphere (creators of Cursor)

sualehasif2y ago

QuinnyPig2y ago

I hear you, but disclosing the relationship up front is the best way to disarm suspicions of bias.

brucethemoose22y ago· 2 in thread

> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)

Well there is your problem.

LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)

https://github.com/turboderp/exllama#dual-gpu-results

And this is without any consideration of batching (which I am not familiar with TBH).

There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.

amanrs2y ago

The key misconception about many quantization methods is that lower precision = better speed.

I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf

It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5

For larger batch sizes, bf16 costs dip below 3-bit quantized.

brucethemoose22y ago

exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).

And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.

joefourier2y ago· 2 in thread

shostack2y ago

What are my options for running llama 2 on a single 3080?

brucethemoose22y ago

Llama.cpp, through kobold.cpp, offloading some of it to ram.

j / k navigate · click thread line to collapse