1. Commodity hardware can do the inference on a single instance (must be true if a user device can do it).
2. It’s apparently possible to run a video game streaming service for $10/month/user.
3. So users should be able to generate unlimited images (one at a time) for $10/month?
Maybe the answer is the DallE/Midjourney models running in the cloud are super inefficient and Stable Diffusion is better. So the services will need to care about optimizing to get that kind of performance. But it’s not inherently expensive because they run it on the cloud.