There are plenty of usecases for generating art/images where a latency of days or weeks would be competitive with the current state of the art.
For example, corporate graphics design, logos, brand photography, etc.
I really do think inference time is a red herring for the first generation of these models.
Sure, the more transformative use-cases like real-time content generation to replace movies/games, but there is a lot of value to be created prior to that point.