Why should it work different? Text files are just regular files where the bytes only use a specific subset of the possible values they could have.
If you do have some knowledge about the data you are encoding, you can be a little bit smarter about it: e.g. for the text section of an executable, you might work on individual instructions instead of bytes, maybe use common, prepared Huffman Trees, so you don't have to encode the tree itself.
On a side note: IIRC the Intel Management Engine does that using a proprietary Huffman Tree, backed into the hardware itself, as an obfuscation technique[1].
To circle back to your question: PNG simply feeds the pixel data into the zlib deflate() function as-is.
[1] https://en.wikipedia.org/wiki/Intel_Management_Engine#Design
And it actually does often work differently for PNG! PNG has a handful of preprocessing options for the pixels. So in filter mode 2, for example, deflate is encoding the difference between each pixel and the pixel above it. More or less.
Some encoders just use a precomputed Huffman table, but you can make an optimized one as discussed here[3].
Huffman is not the most optimal compression that can be used, for example Dropbox's Lepton[4] saves an additional 20% by replacing the Huffman stage with something better.
However, since it's a lossless stage this can be done transparently, which is nice.
[1]: http://www.robertstocker.co.uk/jpeg/jpeg_new_11.htm
[2]: https://www.impulseadventure.com/photo/jpeg-huffman-coding.h...
[3]: https://www.impulseadventure.com/photo/optimized-jpeg.html
[4]: https://dropbox.tech/infrastructure/lepton-image-compression...
Anything to save a few bits...
It is generally used as the final stage of some other compression algorithm, and operates on symbols generated by that algorithm. Often, this is some variation on LZ77, and the symbols are something like "bytes 0-255" in addition to various symbols that denote a match in previous data of some length and at some offset.
> The advantage of a canonical Huffman tree is that it can be encoded in fewer bits than an arbitrary tree.
To take the example from the Wikipedia article you linked, canonicalizing
B = 0
A = 11
C = 101
D = 100
into B = 0
A = 10
C = 110
D = 111
is conceptually the same as canonicalizing the tree (treating "0" as "left" and "1" as "right"): ┌------┐
┌-----┤(root)├-----┐
│ └------┘ │
┌-┴-┐ ┌-┴-┐
│ B │ ┌-┤ ├-┐
└---┘ │ └---┘ │
┌-┴-┐ ┌-┴-┐
┌-┤ ├-┐ │ A │
│ └---┘ │ └---┘
┌-┴-┐ ┌-┴-┐
│ D │ │ C │
└---┘ └---┘
into ┌------┐
┌-----┤(root)├-----┐
│ └------┘ │
┌-┴-┐ ┌-┴-┐
│ B │ ┌-┤ ├-┐
└---┘ │ └---┘ │
┌-┴-┐ ┌-┴-┐
│ A │ ┌-┤ ├-┐
└---┘ │ └---┘ │
┌-┴-┐ ┌-┴-┐
│ C │ │ D │
└---┘ └---┘
So the trees aren't merely a "distraction" IMO: apart from being useful conceptually (e.g. in proving the optimality of the Huffman coding—this is how Knuth does it in TAOCP Vol 1, 2.3.4.5), certain applications of Huffman's algorithm (other than compression) also have the tree structure naturally arise (Knuth gives the example of choosing the optimal order for pairwise merging of N sorted lists of different lengths).Sure, after using trees for understanding, you don't need to actually represent a tree in your data structures / source code / encoding, but that's another matter.
[1] https://cbloomrants.blogspot.com/2010/08/08-12-10-lost-huffm...
The former was confusing at first but mind-blowing when it clicked for me, and the latter looks awesome but the implementation is incomprehensible to me —I guess I'm stuck allocating trees until I can figure it out!
If we have three symbols, A, B, and C, and let's say we assign bit lengths of A=1, B=2, C=2, meaning A is the most probable then we count:
A = 0
B = 10
C = 11
For code lengths A=1, B=2, C=4, D=4, E=4, then we have: A = 0
B = 10
C = 1100
D = 1101
E = 1110
Note that all we need to send is the bit lengths (1, 2, 4, 4, 4) to the other side, and we have an algorithm to assign the bits. Though, even (1, 2, 4, 4, 4) is actually too much information, we just need to send the number of symbols for each given length: (1, 1, 0, 3)Much faster, smaller, and generally better than sending over a whole tree.
...but I believe DEFLATE goes the exact opposite way, for reasons unknown.
As always, ryg has great posts on this stuff: https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...
I think it still suffices as an example because it would be easy to infer how this scales up to an entire book. Finer details are left out, but is that the sort of detail that should be present in a very short introductory article?
My intention was to pick an example that produced a small tree with only a few leaf nodes (so that the diagram was easy to follow), but still contained some duplication.
My hope was that somebody new to the concept could then infer the results for larger inputs.
I did not intend to imply that this would be a valid use case for building a Huffman tree in practice.
Spent several hours talking various topics, but one of his favorite areas of exploration when I was around campus was paper folding.
A couple of links with examples:
- https://www.cise.ufl.edu/~manuel/huffman/index.html
- https://collections.mitmuseum.org/collection/david-a-huffman...
For example, a "canonical Huffman code" can save you space encoding the table through some clever trickery. In short, you can encode the table by counting the number of bytes that use each bit-count, and the bytes used in the file. You don't need to store the patterns at all, since in the canonical algorithm the patterns are regenerated from the bit-lengths. [1]
Right now I'm trying to implement the package-merge algorithm,[2] which will allow me to create the table without building an inefficient fully-allocated tree, and more importantly will allow me to limit the lengths of my codes (the maximum code length is n-1, where `n` is the length of your alphabet. Working with bytes and using 255 bit codes is obnoxious). Unfortunately all explanations of the algorithm I've found are very academic and mathematical, so I'm having trouble working it out.
Some of you might be interested in the short video Tom Scott made about Huffman coding.[3]
1. https://en.wikipedia.org/wiki/Canonical_Huffman_code#Encodin...
It sort of goes to show that while Huffman codes have been around for ages, implementations can still improve, especially as the hardware we use changes.
What's the reasoning of leaving out the tree structure needed for decoding in this argument?
It captivated my interest immediately - it was such a simple and effective approach, and it demystified how compression algorithms worked.
I really like this overview. It's not meant to be a full discourse, more just an intro for newcomers. And I think it gets the basic idea across very effectively.
Probably the first non-trivial data structure/algorithm I had implemented (previously I was making games, where you need no algorithm more complex than rectangle intersection)