The smaller the model, the less has to be read from ram for every single token.
Batching mixes up this calculus a bit.