If the process of image generation / description was fully reversible we could store image descriptions instead of a list of pixels...
But if one feeds an image description from chatGPT to Dall-e and back in a loop, how many steps does it take to revert to pure noise? (surely this has been tried? but I couldn't find it)
I mean there are billions of perceptually distinct images that map to the same “text description”. So text would generally be both lossy and inefficient.
> instead of a list of pixels
We don’t store lists of pixels. Not even lossless formats like PNG does that. Good ole JPEG has 1:10 - 1:20 compression ratio, ballpark.