Model: Qwen3 32b
GPU: RTX 5090 (no rops missing), 32 GB VRAM
Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.
RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.
The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).
This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.