An ode to bzip (opens in new tab)

(purplesyringa.moe)

170 pointssigna1111d ago92 comments

92 comments

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

0cf8612b2e1e11d ago

There was an analysis which argued that zstd is pareto optimal.

https://insanity.industries/post/pareto-optimal-compression/

lucb1e11d ago

> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].

(Omissions theirs.)

Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)

Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name

1 more reply

usefulcat11d ago

Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.

jgalt21211d ago

10 years ago we look at replacing gzip archives with bzip. It was just too slow for the incremental space saving gains offered. Creating bzip archives took forever. I know xz is "better", but we still just gzip everywhere.

1 more reply

darkwater11d ago

You can pry bzip from my dead cold hands.

int_19h11d ago

Article has the numbers for their input:

uncompressed: 327005

(gzip) zopfli --i100: 75882

zstd -22 --long --ultra: 69018

xz -9: 67940

brotli -Z: 67859

lzip -9: 67651

bzip2 -9: 63727

bzip3: 61067

2 more replies

idoubtit11d ago

My experience does not match theirs when compressing text and code:

> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

mppm11d ago

What you are seeing here is probably the effect of window size. BZip has to perform the BWT strictly block-wise and is quite memory-hungry, so `bzip2 -9` uses a window size of 900KB, if I recall correctly. Dictionary-based algorithms are more flexible in this regard, and can gain a substantial advantage on very large and repetitive files. The article kind of forgets to mention this. Not that BZip isn't remarkably efficient for its simplicity, but it's not without limitations.

bigiain11d ago

I suspect the reason for the difference here may be specific use case and the implications there on the size of the files? The author's use case is Lua files to run in Minecraft, and I strongly suspect their example file at 327KB is very much closer to "typical" for that use case than a 1GB SQL file.

It wouldn't surprise me at all that "more modern" compression techniques work better on larger files. It also wouldn't surprise me too much if there was no such thing as a 1GB file when bzip was originally written, according to Wikipedia bzip2 is almost 30 years old "Initial releases 18 July 1996". And there are mentions of the preceding bzip (without the 2) which must have been even earlier than that. In the mid/late 90s I was flying round the world trips with a dozen or so 380 or 500MB hard drives in my luggage to screw into our colo boxen in Singapore London and San Francisco (because out office only has 56k adsl internet).

adrian_b11d ago

For large files, it is frequent to obtain much higher compression ratios when using a preprocessing method, e.g. by using lrzip (which invokes internal or external standard compressors after preprocessing the input to find long-range similarities).

For instance, "lrzip -b", which uses bzip2 for compression, typically achieves much higher compression ratios on big files than using either xz or zstd alone. Of course, you can also use lrzip with xz or zstd, with various parameters, but among the many existing possibilities you must find an optimum compromise between compression ratio and compression/decompression times.

ac2911d ago

> Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's

I compressed kernel 6.19.8 with zstd -19 --long and bzip3 (default settings). The latter compressed better and was about 8x faster.

cogman1011d ago

bzip is old and slow.

It was long surpassed by lzma and zstd.

But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.

duskwuff11d ago

Also potentially relevant: in the 00s, the performance gap between gzip and bzip2 wasn't quite as wide - gzip has benefited far more from modern CPU optimizations - and slow networks / small disks made a higher compression ratio more valuable.

yyyk11d ago

Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.

1 more reply

eviks11d ago

is SQL file text or code?

8n4vidtmkvmk11d ago

And here i got best compression out of xz for SQL.

thesz11d ago

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"

The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law

Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.

fl0ki11d ago

This seems as good a thread as any to mention that the gzhttp package in klauspost/compress for Go now supports zstd on both server handlers and client transports. Strangely this was added in a patch version instead of a minor version despite both expanding the API surface and changing default behavior.

https://github.com/klauspost/compress/releases/tag/v1.18.4

klauspost11d ago

About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.

Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.

fl0ki11d ago

Yeah, I make it a habit to read the changelogs of every update to every direct dependency. I was anticipating this change for years, thanks for doing it!

terrelln11d ago

> Also making good progress on getting a slimmer version of zstd into the stdlib

Awesome! Please let me know if there is anything I can do to help

1 more reply

hexxagone11d ago

Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").

lucb1e11d ago

Was the name used with permission? Even if not trademarked (because open source freedom woohoo), it's a bit weird to release, say, Windows 12 without permission from the authors of Windows 11

I tried looking it up myself but it doesn't say in the readme or doc/ folder, there is no mention of any of the Bzip2 authors, and there is no website listed so I presume this Github page is canonical

tosti11d ago

Free software doesn't have a lot of trademarks, with the notable exception of Linux.

Also, the name is the algorithm. Bzip2 has versions and bzip3 is something else which has its own updated versions. Programs that implement a single algorithm often follow this pattern.

1 more reply

eichin11d ago

Interesting detail on the algorithm but seems to completely miss that if you care about non-streaming performance, there are parallel versions of xz and gzip (pxzip encodes compatible metadata about the breakup points so that while xz can still decompress it, pxzip can use as many cores as you let it have instead.) Great for disk-image OS installers (the reason I was benchmarking it in the first place - but this was about 5 years back, I don't know if those have gotten upstreamed...)

shawn_w11d ago

There's also a parallel version of bzip2, pbzip2.

https://man.archlinux.org/man/pbzip2.1.en

And zstd is multi threaded from the beginning.

joecool102911d ago

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

lucb1e11d ago

> bzip2 and xz are extremely slow to compress

This depends on the setting. At setting -19 (not even using --long or other tuning), Zstd is 10x slower to compress than bzip2, and 20x slower than xz, and it still gets a worse compression ratio for anything that vaguely looks like text!

But I agree if you look at the decompression side of things. Bzip2 and xz are just no competition for zstd or the gzip family (but then gzip and friends have worse ratios again, so we're left with zstd). Overall I agree with your point ("just use zstd") but not for the fast compression speed, if you care somewhat about ratios at least

hexxagone11d ago

In the LZ high compression regime where LZ can compete in terms of ratio, BWT compressors are faster to compress and slower to decompress than LZ codecs. BWT compressors are also more amenable to parallelization (check bsc and kanzi for modern implementations besides bzip3).

silisili11d ago

I'd argue it's more workload dependent, and everything is a tradeoff.

In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.

If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.

joecool102911d ago

> If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.

EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/

1 more reply

NooneAtAll311d ago

why would one even care about compression speed on minecraft ComputerCraft machine?

size and decompression are the main limitations

pella11d ago

imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )

lucb1e11d ago

It's good, but is it "the future" when it's extra work?

Consider that you could hand-code an algorithm to recognize cats in images but we would rather let the machine just figure it out for itself. We're kind of averse to manual work and complexity where we can brute force or heuristic our way out of the problem. For the 80% of situations where piping it into zstd gets you to stay within budget (bandwidth, storage, cpu time, whatever your constraint is), it's not really worth doing about 5000% more effort to squeeze out thrice the speed and a third less size

It really is considerably better, but I wonder how many people will do it, which means less implicit marketing by seeing it everywhere like we do the other tools, which means even fewer people will know to do it, etc.

cgag11d ago

This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803

srean11d ago

That is an interesting link.

Does gmail use a special codec for storing emails ?

duskwuff11d ago

The biggest savings for a service like GMail are going to be based around deduplication - e.g. if you can recognize that a newsletter went out to a thousand subscribers and store those all as deltas from a "canonical" copy - congratulations, that's >1000:1 compression, better than you could achieve with any general-purpose compression. Similarly, if you can recognize that an email is an Amazon shipping confirmation or a Facebook message notification or some other commonly repeated "form letter", you can achieve huge savings by factoring out all the common elements in them, like images or stylesheets.

2 more replies

cobbzilla11d ago

My first exposure to bzip: The first Linux kernels I ever compiled & built myself (iirc ~v2.0.x), I packed as .tar.bz2 images. Ah the memories.

Yes, there are better compression options today.

Grom_PE11d ago

PPMd (of 7-Zip) would beat BZip2 for compressing plain text data.

vintermann11d ago

If you're implementing it for Computercraft anyway, there's no reason to stick to the standard. It's well known that bzip2 has a couple of extra steps which don't improve compression ratio at all.

I suggest implementing Scott's Bijective Burrows-Wheeler variant on bits rather than bytes, and do bijective run-length encoding of the resulting string. It's not exactly on the "pareto frontier", but it's fun!

jiggawatts11d ago

I recently did some works on genomics / bioinformatics, where terabyte-sized text datasets are common and often compressed and decompressed multiple times in some workloads. This often becomes the serial bottleneck we all know and love from Amdahl's law.

I ran a bunch of benchmarks, and found that the only thing that mattered was if a particular tool or format supported parallel compression and/or parallel decompression. Nothing else was even close as a relevant factor.

If you're developing software for processing even potentially large files and you're using a format that is inherently serial, you've made a mistake. You're wasting 99.5% of a modern server's capacity, and soon that'll be 99.9%.

It really, really doesn't matter if one format is 5% faster or 5% bigger or whatever if you're throwing away a factor of 200 to 1,000 speedup that could be achieved through parallelism! Or conversely, the ability to throw up to 1,000x the compute at improving the compression ratio in the available wall clock time.

Checksumming, compression, decompression, encryption, and decryption over bulk data must all switch to fully parallel codecs, now. Not next year, next decade, or the year after we have 1,000+ core virtual machines in the cloud available on a whim. Oh wait, we do already: https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

Not to mention: 200-400 Gbps NICs are standard now in all HPC VMs and starting to become commonplace in ordinary "database optimised" cloud virtual machines. Similarly, local and remote SSD-backed volumes that can read and write at speeds north of 10 GB/s.

There are very few (any?) compression algorithms that can keep up with a single TCP/IP stream on a data centre NIC or single file read from a modern SSD unless using parallelism. Similarly, most CPUs struggle to perform even just SHA or AES at those speeds on a single core.

hexxagone11d ago

Then you should take a look at https://github.com/flanglet/kanzi-cpp: it is optimized for fast roundtrips, multi-threaded by design and produces a seekable bitstream.

roytam8710d ago

I wonder why people think there is only 1 bzip (which seems to be bzip2), when the original bzip exists: https://unix.stackexchange.com/a/125803 and also bzip3: https://github.com/iczelia/bzip3

j / k navigate · click thread line to collapse

92 comments

saghm11d ago

0cf8612b2e1e11d ago

There was an analysis which argued that zstd is pareto optimal.

https://insanity.industries/post/pareto-optimal-compression/

lucb1e11d ago

> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].

(Omissions theirs.)

1 more reply

usefulcat11d ago

Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.

jgalt21211d ago

1 more reply

darkwater11d ago

You can pry bzip from my dead cold hands.

int_19h11d ago

Article has the numbers for their input:

uncompressed: 327005

(gzip) zopfli --i100: 75882

zstd -22 --long --ultra: 69018

xz -9: 67940

brotli -Z: 67859

lzip -9: 67651

bzip2 -9: 63727

bzip3: 61067

2 more replies

idoubtit11d ago

My experience does not match theirs when compressing text and code:

> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

mppm11d ago

bigiain11d ago

adrian_b11d ago

ac2911d ago

> Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's

I compressed kernel 6.19.8 with zstd -19 --long and bzip3 (default settings). The latter compressed better and was about 8x faster.

cogman1011d ago

bzip is old and slow.

It was long surpassed by lzma and zstd.

But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.

duskwuff11d ago

yyyk11d ago

Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.

1 more reply

eviks11d ago

is SQL file text or code?

8n4vidtmkvmk11d ago

And here i got best compression out of xz for SQL.

thesz11d ago

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"

The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law

fl0ki11d ago

https://github.com/klauspost/compress/releases/tag/v1.18.4

klauspost11d ago

About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.

Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.

fl0ki11d ago

Yeah, I make it a habit to read the changelogs of every update to every direct dependency. I was anticipating this change for years, thanks for doing it!

terrelln11d ago

> Also making good progress on getting a slimmer version of zstd into the stdlib

Awesome! Please let me know if there is anything I can do to help

1 more reply

hexxagone11d ago

lucb1e11d ago

Was the name used with permission? Even if not trademarked (because open source freedom woohoo), it's a bit weird to release, say, Windows 12 without permission from the authors of Windows 11

tosti11d ago

Free software doesn't have a lot of trademarks, with the notable exception of Linux.

Also, the name is the algorithm. Bzip2 has versions and bzip3 is something else which has its own updated versions. Programs that implement a single algorithm often follow this pattern.

1 more reply

eichin11d ago

shawn_w11d ago

There's also a parallel version of bzip2, pbzip2.

https://man.archlinux.org/man/pbzip2.1.en

And zstd is multi threaded from the beginning.

joecool102911d ago

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

lucb1e11d ago

> bzip2 and xz are extremely slow to compress

hexxagone11d ago

silisili11d ago

I'd argue it's more workload dependent, and everything is a tradeoff.

In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.

If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.

joecool102911d ago

> If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.

EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/

1 more reply

NooneAtAll311d ago

why would one even care about compression speed on minecraft ComputerCraft machine?

size and decompression are the main limitations

pella11d ago

imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )

lucb1e11d ago

It's good, but is it "the future" when it's extra work?

cgag11d ago

This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803

srean11d ago

That is an interesting link.

Does gmail use a special codec for storing emails ?

duskwuff11d ago

2 more replies

cobbzilla11d ago

My first exposure to bzip: The first Linux kernels I ever compiled & built myself (iirc ~v2.0.x), I packed as .tar.bz2 images. Ah the memories.

Yes, there are better compression options today.

Grom_PE11d ago

PPMd (of 7-Zip) would beat BZip2 for compressing plain text data.

vintermann11d ago

If you're implementing it for Computercraft anyway, there's no reason to stick to the standard. It's well known that bzip2 has a couple of extra steps which don't improve compression ratio at all.

jiggawatts11d ago

hexxagone11d ago

Then you should take a look at https://github.com/flanglet/kanzi-cpp: it is optimized for fast roundtrips, multi-threaded by design and produces a seekable bitstream.

roytam8710d ago

I wonder why people think there is only 1 bzip (which seems to be bzip2), when the original bzip exists: https://unix.stackexchange.com/a/125803 and also bzip3: https://github.com/iczelia/bzip3

j / k navigate · click thread line to collapse