As to comparing parameter counts, I disagree with you. I think it's perfectly OK to compare parameter counts for different kinds of models. It would also be perfectly OK to compare, say, computational efficiency per parameter in each forward pass (which for this model is impressive), but that wasn't the focus on my comment above.
Finally, you're right that I didn't mention all the interim parameter counts that we have seen below 600B in all transformer variants. The list would have been way too long had I tried to include every figure!