undefined | Better HN

0 pointsmiki1232115mo ago0 comments

Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

0 comments

cnr5mo ago

Let's say, then, that it's not so much "dumb and wasteful" as "energy inefficient". In fact, this can be quite wise in a modern world full of surveillance-as-a-business and "us-east-1 disasters"

cnr5mo ago

Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?

carderne5mo ago

Layman understanding:

Because as a function of hardware and electricity costs, a “cloud” GPU will be many times more efficient per output token. You aren’t loading/offloading models and don’t have any parts of the GPU waiting for input. Everything is fully saturated always.

OJFord5mo ago

I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.

behnamoh5mo ago

> This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

Local models are never a dumb idea. The only time it's dumb to use them in an enterprise is if the infra is Mac Studio with M3 Ultra because pp time is terrible.

j / k navigate · click thread line to collapse

0 pointsmiki1232115mo ago0 comments

Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

0 comments

cnr5mo ago

Let's say, then, that it's not so much "dumb and wasteful" as "energy inefficient". In fact, this can be quite wise in a modern world full of surveillance-as-a-business and "us-east-1 disasters"

cnr5mo ago

Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?

carderne5mo ago

Layman understanding:

OJFord5mo ago

I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.

behnamoh5mo ago

> This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

Local models are never a dumb idea. The only time it's dumb to use them in an enterprise is if the infra is Mac Studio with M3 Ultra because pp time is terrible.

j / k navigate · click thread line to collapse