undefined | Better HN

0 pointsbrucethemoose22y ago0 comments

A single GPU with a batch size of 1 can serve many users, higher batch sizes can serve many dozens, pool a few and you can serve a sizable userbase.

It may not be super profitable, but its not untenable either.

0 comments

minimaxir2y ago

LLMs are GPU compute-bound. If you infer at batch_size = 1 on a model like Llama 2 7B on a "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute, which means you get no benefit from batching.

The exception is the A100 GPU which does not use 100% of GPU compute and therefore you get benefit from batching, but is hella expensive.

The economics are not simple, and in most cases "just use the ChatGPT API" is also the most cost-effective option anyways. A smaller 1.1B model (which would likely not be compute-bound) with similar performance to a 7B model may tip the scales.

brucethemoose2OP2y ago

> LLMs are GPU compute-bound.

From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.

It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...

two_in_one2y ago

> "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute,

LLM with batch_size=1 technically cannot use '100%' of GPU. Because it has to move a lot of data around and use different blocks of GPU. So, when tensor cores are used cuda cores are idle. Tensor cores are used for matrix multiplication, cuda cores for activation functions (I'm simplifying). Model has to use both at different times moving data between them. Meanwhile GPU monitor may report 100%. But it's still possible to insert another process. I think I've seen this idea in Pytorch docs.

As for 1.1B LLM, it would be nice. Interesting experiment anyway. I'm only afraid that with big and diverse dataset model will focus more on memorization and generic logic may not emerge. They aren't doing anything new in terms of architecture and training methods.

j / k navigate · click thread line to collapse