My current rule of thumb is 1GB gets you 1B parameters with a big context. (Qwen 32B fits in 32GB with 200K+ contexts)
That’s with heavy compression of the weights and the context, of course.
I haven’t gone through model evaluation + shoehorning at 128GiB yet.