Learn More: https://www.dkriesel.com/start?do=search&id=en%3Aperson&q=Xe...
Brief: Xerox machines used template matching to recycle the scanned images of individual digits that recur in the document. In 2013, Kriesel discovered this procedure was faulty.
Rationale: This method can create smaller PDFs, advantageous for customers that scan and archive numerical documents.
Prior art: https://link.springer.com/chapter/10.1007/3-540-19036-8_22
Tech Problem: Xerox's template matching procedure was not reliable, sometimes "papering over" a digit with the wrong digit!
PR Problem: Xerox press releases initially claimed this issue did not happen in the factory default mode. Kriesel demonstrated this was not true, by replicating the issue in all of the factory default compression modes including the "normal" mode. He gave a 2015 FrOSCon talk, "Lies, damned lies and scans".
Interesting work!
The contextual information surrounding intentional data loss needs to be preserved. Without that context, we become ignorant of the missing data. Worst case, you get replaced numbers. Average case, you get lossy->lossy transcodes, which is why we end up with degraded content.
There are only two places to put that contextual information: metadata and watermarks. Metadata can be written to a file, but there is no guarantee it will be copied with that data. Watermarks fundamentally degrade the content once, and may not be preserved in derivative works.
I wish that the generative model explosion would result in a better culture of metadata preservation. Unfortunately, it looks like the focus is on metadata instead.
One key item with emerging 'AI compression' techniques is the information loss is not deterministic which somewhat complicates assessing suitability.
It is technically possible to make it deterministic.
The main reason you don't deterministic outputs today is that Cuda/GPU optimizations make the calculations run much faster if you let them be undeterministic.
The internal GPU scheduler will then process things in the order it thinks is fastest.
Since floating point is not associative, you can get different results for (a + (b + c)) and ((a + b) + c).
Many core codecs are pretty good at adhering to reference implementations, but are still open to similar issues so may not be bit exact.
With a DCT or wavelet transform, quantisation, chroma subsampling, entropy coding, motion prediction and the suite of other techniques that go into modern media squishing it’s possible to mostly reason about what type of error will come out the other end of the system for a yet to be seen input.
When that system is replaced by a non-linear box of mystery, this ability is lost.
2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.
> Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary)
Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).
The "specifics" here is exactly what the whole topic is focus on, so you can't really gloss over it.
It is relevant only when you assume that lossy compression has no way to control or even know of such critical changes. In reality most lossy compression algorithms use a rate-distortion optimization, which is only possible when you have some idea about "distortion" in the first place. Given that the error rarely occurred in higher dpis, its cause should have been either a miscalculation of distortion or a misconfiguration of distortion thresholds for patching.
In any case, a correct implementation should be able to do the correct thing. It would have been much problematic if similar cases were repeated, since it would mean that it is much harder to write a correct implementation than expected, but that didn't happen.
> Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).
Traditional compressions simply didn't have much computational power to do so. The "blurry blob" is something with lower-frequency components only by definition, and you have only a small number of them, so they were easier to preserve even with limited resources. But if you have and recognize a similar enough pattern, it should be exploited for further compression. Motion compensation in video codecs were already doing a similar thing, and either a filtering or intelligent quantization that preserves higher-frequency components would be able to do so too.
----
> 2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.
I admit I have generalized too much, but the choice of scan resolution is highly specific to contents, font sizes and even writing systems. If you and your company can cope with lower DPIs, that's good for you, but I believe 300 dpi is indeed the safe minimum.
https://pub.towardsai.net/stable-diffusion-based-image-compr...
HN discussion: https://news.ycombinator.com/item?id=32907494
Even when going down to 4-6 bits per latent space pixel the results are surprisingly good.
It's also interesting what happens if you ablate individual channels; ablating channel 0 results in faithful color but shitty edges, ablating channel 2 results in shitty color but good edges, etc.
The one thing it fails catastrophically on though is small text in images. The Stable Diffusion VAE is not designed to represent text faithfully. (It's possible to train a VAE that does slightly better at this, though.)
This is intuitive, as the competition organisers say: compression is prediction.
Best overview you can probably get from “JPEG AI Overview Slides”
However, this weekend someone released an open-source version which has a similar output. (https://replicate.com/philipp1337x/clarity-upscaler)
I'd recommend trying it. It takes a few tries to get the correct input parameters, and I've noticed anything approaching 4× scale tends to add unwanted hallucinations.
For example, I had a picture of a bear I made with Midjourney. At a scale of 2×, it looked great. At a scale of 4×, it adds bear faces into the fur. It also tends to turn human faces into completely different people if they start too small.
When it works, though, it really works. The detail it adds can be incredibly realistic.
Example bear images:
1. The original from Midjourney: https://i.imgur.com/HNlofCw.jpeg
2. Upscaled 2×: https://i.imgur.com/wvcG6j3.jpeg
3. Upscaled 4×: https://i.imgur.com/Et9Gfgj.jpeg
----------
The same person also released a lower-level version with more parameters to tinker with. (https://replicate.com/philipp1337x/multidiffusion-upscaler)
Their example with the cake is the most obvious. To me, the original image shows a delicious cake, and the modified one shows a cake that I would rather not eat...
Make sure to test the models before you deploy. Nothing will be lossless doing superresolution but flows can get you lossless in compression.
[0] https://huggingface.co/docs/diffusers/api/pipelines/stable_d...
The interesting diagram to me is the last one, for computational cost, which shows the 10x penalty of the ML-based codecs.
One problem is that without broad adoption, support even in niche cases is precarious; the ecosystem is smaller. That makes the codec not safe for archiving, only for distribution.
The strongest use case I see for this is streaming video, where the demand for compression is highest.
Plus the entire model, which comes with incorrect cache headers and must be redownloaded all the time.
I looked at the bear example above and I could see how either the AI thought that there was an animal face embedded in the fur or we just see the face in the fur. We see all kinds of faces on toast even though neither the bread slicers nor the toasters intend to create them.