Nice post! You piqued my curiosity, so after a bit of research it turns out that, with techniques like MTP/MLA/CSA, it's quite probable that these models are much more efficient (and maybe bigger? tho 400B sounds about right) than a simple RAM breakdown would suggest.
MTP - https://blog.google/innovation-and-ai/technology/developers-...
MLA - https://machinelearningmastery.com/a-gentle-introduction-to-...
CSA - https://deepseek.ai/blog/deepseek-v4-compressed-attention