I think it has to do with text being much more precise. Your stably diffused cartoon avatar having 6 finger is not nearly as noticeable as a language model's chat mispelling every second word. So you need less resources to get to a human acceptable result