Estimating required GPU memory for serving LLMs (opens in new tab)

(substratus.ai)

2 pointssamosx2y ago2 comments

2 comments

2 comments · 1 top-level

samosxOP2y ago· 1 in thread

Having a hard time with estimating how much GPU memory that LLM needs to serve it? What kind of GPUs to use and how many?

Wrote a blog post to demystify the process of GPU memory usage estimating.

brianjking2y ago

My issue is figuring out how to identify how many concurrent users you can support on average on a given GPU.

Understanding the vram to simply load the weights is easy enough. When you are allowing for something like content generation with varying lengths of input/output tokens, how do you even begin to identify the GPUs you need?

j / k navigate · click thread line to collapse