undefined | Better HN

0 points0cf8612b2e1e10mo ago0 comments

During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?

I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.

0 comments

9 comments · 5 top-level

cornholio10mo ago· 2 in thread

Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.

apitman10mo ago

I was under the impression that context takes up a lot more VRAM than this.

cornholio10mo ago

The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits.

The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.

computomatic10mo ago· 2 in thread

This is great product design at its finest.

First of all, they never “handle more requests than they have hardware.” That’s impossible (at least as I’m reading it).

The vast majority of usage is via their web app (and free accounts, at that). The web app defaults to “auto” selecting a model. The algorithm for that selection is hidden information.

As load peaks, they can divert requests to different levels of hardware and less resource hungry models.

Only a very small minority of requests actually specify the model to use.

There are a hundred similar product design hacks they can use to mitigate load. But this seems like the easiest one to implement.

addaon10mo ago

> But this seems like the easiest one to implement.

Even easier: Just fail. In my experience the ChatGPT web page fails to display (request? generate?) a response between 5% and 10% of the time, depending on time of day. Too busy? Just ignore your customers. They’ll probably come back and try again, and if not, well, you’re billing them monthly regardless.

nocturnes10mo ago

Is this a common experience for others? In several years of reasonable ChatGPT use I have only experienced that kind of failure a couple of times.

2 more replies

parentheses10mo ago

They probably do lots of tricks like using quantized or distilled models during times of high load. They also have a sizeable number of free users, who will be the first to get rate limited.

the847210mo ago

During peaks they can kick out background jobs like model training or API users doing batch jobs.

vikramkr10mo ago

In addition to stuff like that they also handle it with rate limits, that message that Claude would throw almost all the time when they were like "demand is high so you have automatically switched to concise mode", making batch inference cheaper for API customers to convince them to use that instead of real time replies. The site erroring out during a period of high demand also works, prioritizing business customers during a rollout, the service degrading. It's not like any provider has a track record for effortlessly keeping responsiveness super high. Usually it's more the opposite.

j / k navigate · click thread line to collapse