It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.
Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.
wget http://corpus.canterbury.ac.nz/resources/calgary.tar.gz
zcat calgary.tar.gz|time zstd -19|wc -c
902963 (=902.9KB, vs. 807.9KB for bzip3)
> "Better than bzip2" isn't exactly a convincing sales pitchSure, but nobody is pitching that. TFA does comapre to lzma (~xz), and claims bzip3 outperforms it quite handsomely in speed, while being competitive in compression ratio.
Original file: 129MB xz, 1.2G uncompressed.
"zstd -T0": 1.34 seconds, 189M
"xz -T0": 63 seconds, 131M
"xz -T0 -9": 183 seconds, 125M
"bzip3 -e -j 6": 21 seconds, 129M (edited, was SIGSEGV)
"bzip3 -e": 84 seconds, 129M
I used linux source because the source website uses linux and recommends bzip3 for compressing source and text. Results were on Ubuntu 22.04, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.
I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.
I think it boils down to the feelings of the author (of the previous format).
I don't think PKWARE feels bad because ZSTD is a homage to ZIP. Similarly if someone created a follow-up file format to something I've designed, I'd just want credit or a link to my version as a homage and pointer for history continuity, nothing else.
Open source software is designed to be mangled, modified, shared and leapfrogged. If a completely different implementation advertises itself as a newer iteration of a format because it's built on the same theory, I think it's ethical if the developer is not intending to capitalize name. Either case, if the original developer returns to the game, it can create a BZIP4 and points to the diversion as, "hey, somebody liked BZIP2 too much and created this, give him/her a kudos", and continue.
The original author replaced their own bzip release with bzip2 to avoid a patent issue with arithmetic coding, so this is the first time a third party has done so: https://web.archive.org/web/19980704181204/http://www.muraro...
So if the release of bzip3 is approved by the current maintainer, then I guess it’s fine, but otherwise it makes me uncomfortable to consider using under this name.
I agree in spirit, but I can also see why someone might want their source to be free and mangleable, but still care about trademarks.
(Just imagine Linus Torvalds getting lots of emails with support requests for a hypothetical Linux2 operating system that I wrote, and that he has no relation with. That could become pretty annoying; even if he doesn't mind me taking his source code.)
Is zstd actually an homage to zip?
I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.
The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format. As do two very old, obsolete Unix compression programs: "pack"[3], which is uses a ".z" suffix for compressed files, and "compress"[4], which uses a ".Z" suffix for compressed files.
In those algorithm names, the L and Z stand for Lempel and Ziv, respectively. But interestingly, the Unix "pack" program uses a ".z" suffix even though its algorithm is just Huffman (not one of the Lempel-Ziv family of algorithms), so the letter Z somehow came to signify data compression more generally.
Rough timeline of letter Z in data compression:
1977: LZ77
1978: LZ78
1982 or earlier: Unix "pack" (.z)
1984: LZW
1985: Unix "compress" (.Z)
1989: PKZIP
---
[1] https://en.wikipedia.org/wiki/LZ77_and_LZ78
[2] https://en.wikipedia.org/wiki/Lempel-Ziv-Welch
$ sudo journalctl -r | pv -a > /dev/null
[22.8MiB/s] $ dd if=/dev/urandom of=test bs=1G count=1 iflag=fullblock
$ gzip -k test
$ zcat test.gz | pv -a >/dev/null
[ 228MiB/s]
$ sudo journalctl -r | pv -a >/dev/null
[13.1MiB/s]
UPDATE: Gzip with more real-world data[1]: $ gzip -k adventures-of-huckleberry-finn.txt
$ zcat adventures-of-huckleberry-finn.txt.gz | pv -a >/dev/null
[ 151MiB/s]
[1]: <https://gutenberg.org/files/76/76-0.txt>So lz4 has some base advantages in terms of speed, which zstd is unlikely to match. But as you point out it is only relevant for very high speed operations.
It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.
And here a BWT library with benchmarks: https://github.com/IlyaGrebnov/libsais#benchmarks
edit: ah, bzip3 is parallelizable, while bzip2 isn't. That alone is enough for me to be able to claim 'faster'.
Right now I'm almost exclusively using zstd (general stuff) or lzma2/xz (high compression where read speed doesn't matter). And of course gz and zip for data interchange where compatibility is key. From the information presented bzip3 won't replace any of those use cases for me, but that's fine. Maybe it fits somebody else's use case, or maybe it's the foundation for the next great algorithm that we all end up using.
% wc -c linux.tar.zst linux.bz3 134980904 linux.tar.zst 129255792 linux.bz3
2. Bzip2 is somewhat is a standard
3. zStandard is not a substitute for Bzip2
When I evaluated various compression algorithms a few years ago zstd came ahead of bzip2 in every metric.
[1]: https://github.com/dsnet/compress/blob/master/doc/bzip2-form...
> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.
> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.
That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?
[1]: C being the compression function, D being the decompression function, and o being function composition.
I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: https://github.com/kspalaiologos/bzip3/issues/2
I don't understand how bzip3 gets to claim "A better, faster and stronger spiritual successor to BZip2." when even all its own benchmarks show it's slower than bzip2?
what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
it should be noted that while using 16 times larger block sizes than bzip2 while providing compression ratios up to 10%-50% better at a cost of, as empirically shown, 17 seconds per 1.3GB of data, is a pretty good trade-off and if bzip2 wanted to get anywhere close to that (e.g. using the C API to tweak the block size), it'd have to sacrifice a lot of its performance.
If I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.
>spiritual successor to BZip2
What does that mean? If it isn't related to bzip2, why choose this name?
Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?
Does the idea even make sense? (I'm a layman)
On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.
Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.
time ./bsc e ../linux.tar linux.bsc -e2 -b16 -T
68.69s user 1.14s system 99% cpu 117M memory 1:09.84 total
While bzip3 uses 98M, takes 1min 17s to produce a 129023171 byte file, compared to 127747834B from BSC. They're very similar except bzip3 tends to use less memory and decompresses a little slower. BSC is much more mature than bzip3 though, and the benchmarks might be a subject to change some time in the future. Surprisingly, BSC code isn't really that robust (I reported a UB bug to libsais and had to pretty much rework the LZP code because it couldn't stand fuzzing).I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.
Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.
Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.
https://quixdb.github.io/squash-benchmark/
I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:
Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?
Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.
...so long as this lives in NSA/Microsoft Github, it's not a 'spiritual successor' to anything.