Alright, I don't understand anything, but you said ~5secs per token, then for prompts with hundreds to a thousand tokens, we are in the orders of tens of minutes to hours. I would be targetting coding prompts.
Well, it means one day I would have to get into the real thing: the real inference code, and actually run the inference of a small model.