If you are doing ~4 bit quantization, a good rule of thumb is just under 1 Gigabyte per 1B parameters, plus a little room for the operating system.
Longer contexts require a bit more VRAM.
For reference, 4 bit LlamaV1 33B fits snugly on a 24GB GPU with 2K context with the exLLaMA backend. But it won't do really long inputs.
Llama.cpp is pretty much the only backend that can offload to CPU efficiently. Its still quite fast and offers very flexible 3-5 bit quantization, with the leanest 3 bit quant just barely fitting LlamaV1 33b on my 6GB + 16GB laptop.