> and blowing up parameter count to make up for it
based on (an admittedly rapid and indulgent reading of the paper), it seems like they're not increasing the parameter size. Do you mind pointing out where the blowup is occurring?
They're saying that likely, models of comparable size will perform worse (the paper claims as good)
But since they are (optimized up to 8 or 10x if packing terns beyond 2 bits, in practice it seems 3-5x considering larger other structures needed in memory) more memory efficient, the largest models can be that much larger.