I've tried something somewhat similar in the past. I was looking at implementing an extremely fast decompressor, with ratio similar to LZ4. I was able to get 2x the decompression speed of LZ4, but struggled with compression ratio. The idea was to have 16 byte matches, and allow the matches to apply a 16-bit mask, telling whether each byte is part of the match or a literal. Then I restricted the compressor to only be able to use 16 distinct masks.
This was extremely fast to decompress, because each 16-byte match is: load the 16-byte match into an AVX2 register, load 16 bytes of literals, load the mask you're using, shuffle the literals, then blend the literals and the match. And because the matches are fixed size, you can start the fetch for multiple matches in parallel.
However, the problem I ran into, and would love to solve, is that I also wanted fast-ish compression speed. And it is very hard to search for good matches quickly. Since you have holes in the match.
I guess the author is looking at GPU compression, so they are taking a somewhat brute-force approach. But I'd be interested to see how they're doing the match finding, and what kind of speed they're getting.
The author mentions in a tweet going from minutes to seconds for compression when switching CPU for GPU[2]. From memory he has made other references to a few seconds for compression being entirely reasonable for such tasks but I can't find a direct reference.
[1]: http://www.binomial.info/
[2]: https://twitter.com/richgel999/status/1476325003662667777
The whole:
LIT 13
LIT 24
LIT 65
LIT 32
...
could have been written as a single instruction: LIT [13, 24, 65, 32, ...]
It's almost as if author tries too hard to support their point that their variant looks better."Notice how much faster it makes progress through the file vs. LZSS"
Yeah, because you encode every literal separately? It's all LIT instructions?
(IE with only original-source copies, and a->delta1->delta2->delta3->b, composing delta1/2/3 prior to applying them is easy and simple to reason about. If you allow target side copies, it is a lot messier)
Way back in the day, when i upgraded subversion's delta format, i did a lot of work on testing out various mechanisms - in practice, target side copies were much worse, and much more expensive to process, than doing source-only copies and then zlib'ing the delta instructions + new data ;)