undefined | Better HN

0 pointsfrabcus9mo ago0 comments

This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.

The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.

0 comments

nodja9mo ago

Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.

frabcusOP9mo ago

Thanks, got it! Think I need a deeper article on this - as comment below says you'd then need to load the request specific state in instead.

j / k navigate · click thread line to collapse

0 comments

nodja9mo ago

frabcusOP9mo ago

Thanks, got it! Think I need a deeper article on this - as comment below says you'd then need to load the request specific state in instead.

j / k navigate · click thread line to collapse