But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.
It depends on what's meant by "fully utilized" but fp8 quants of Nemotron 3 Super, the latest Minimax, Cohere A+ and the Mistral small and (especially) medium variants all sit in that 128-256 category, especially with full context or even moderate concurrency. In fact, in a 192GB environment I work with (Hopper GPUs, fwiw) I was pushed into using 4-bit quants with a couple of those to get the model working with a reasonable context window (..but 256 would have rocked out).