In a hypothetical where I can run my inputs directly on good hardware how should I be thinking of performance of these models in relation to my input / output size?
And what's happening under the hood in the model's architecture? It seems to return 1 token at a time? Does that mean it predicts one token at a time, adds that to the context window, and then does another forward inference pass?
If I tune a model to respond with simple `0` or `1` answers to yes and no questions will that be faster than letting it answer "yes {reason}" or "no {reason}"? If it were inferring single tokens it seems like that would always be faster, but if it were inferring a block of tokens maybe not...?
I'll summarize everything shared here into a blog post for the next person who asks :) Thanks.