Here is a catch tho. It's just "by average" several bytes. We can't tell if some images practically contribute 0 bit to the final results while some others contribute more.
(I know this "contribute" word is a little non-sense in the context of ML. But existing lossy compression algorithms are not that different in this sense: if you compress a 1M frames produced by a 3D renderer to a .mpeg video, each frame doesn't contribute the same amount of bytes to the final result.)