Building a data compression utility in Haskell using Huffman codes (opens in new tab)

(lazamar.github.io)

232 pointslazamar1y ago88 comments

88 comments

65 comments · 15 top-level

ykonstant1y ago· 14 in thread

Hey, since this is likely to attract Haskell programmers: how fast is Haskell these days for a programmer intent on writing optimized code? I am particularly interested in its performance for numerical crunching like matrix operations and other stuff that benefit from SIMD.

lazamarOP1y ago

Haskell's speed can be competitive with systems languages but keep in mind that its killer feature is ease of abstraction.

The idea is that it is simple to assemble multiple parts into a coherent, well organised program. Which is important for the entirety of the program, no just the tight loop.

So, with the nice FFI Haskell has, you can always drop down to languages without a GC for inherently imperative optimisations. Then you wrap that into a library with nice types and you can now leverage that raw power anywhere in your Haskell code where the types will match.

I worked at Meta in a high performance Haskell application and that's what we did. Wrote beautiful, large, fast Haskell programs which in some specialised parts had C++ building blocks. 99% of the time was spent on Haskell land composing things into more and more useful applications.

mrkeen1y ago

I like Haskell performance for every-day backend/web and CLI stuff. But I drop down into Rust when I'm writing something performance-focused.

That said, Haskell's no slouch. Here's a small program to count the 1-bits in a file.

  main :: IO ()
  main = do
    content <- getArgs >>= \[a] -> unsafeMMapVector a Nothing
    print (vectorPopCount content)

  vectorPopCount :: V.Vector Word64 -> Int
  vectorPopCount = V.foldl' (+) 0 . V.map popCount

When you compile with -msse4.2, it will correctly use the hardware popcount instruction, and crunches through a 1GB input file in 0m0,090s. Rounding to the nearest MB, it uses 0 heap. (For the curious, if I compile without -msse4.2 it runs in 0m0,293s).

I haven't tried crunching matrices, but I would start by checking out repa, accelerate, or massiv.

  https://hackage.haskell.org/package/repa
  https://hackage.haskell.org/package/accelerate
  https://hackage.haskell.org/package/massiv

ykonstant1y ago

The lack of heap allocations is great! Thanks for the pointers.

Iceland_jack1y ago

I met Sam Derbyshire at ZuriHac who told me all the difficult architectural work had been done for SIMD support.

+ https://gitlab.haskell.org/ghc/ghc/-/issues/7741

It might make it for GHC 9.12 (for 128 bit vectors only, and mostly floating-point operations unless other people come in and contribute).

The patch is at:

+ https://gitlab.haskell.org/ghc/ghc/-/merge_requests/12860

ykonstant1y ago

Thanks for the info!

wyager1y ago

The reality is that for any language, including C, compiler optimized code will never be as fast as hand optimized code in libraries like BLAS. So at some level, the choice of host language doesn't matter very much, because you're going to be outsourcing all of the computation anyway if you're really serious about speed. This is the same reason all the AI stuff, possibly the single largest consumer of compute in the world, gets away with being written in python except for the low level compute libraries.

To answer your question directly, The GHC compiler is very good. High level code will perform very well, and for most realistic applications, performance bottlenecks are architectural, not e.g. the use of single width versus SIMD, and the "architectural asymptotics" of haskell are very favorable. I think GHC has/is getting SIMD support but that's not what I would focus on when evaluating perf.

I wouldn't write a matrix multiplication algorithm in Haskell, but I also wouldn't write one in rust or C if I was serious about speed.

Many focus on number crunching as a performance metric, but almost no one is ever actually bottlenecked on that, and if they are it doesn't really matter what high level language they're using.

GrantMoyer1y ago

> The reality is that for any language, including C, compiler optimized code will never be as fast as hand optimized code

That's not strictly true; sometimes a C compiler can optimize away my whole program: https://godbolt.org/z/oG5nfGE6z

1 more reply

tkz13121y ago

Haskell really shines when you want to write high level, declarative code. Performance when using this style is generally fine for CLI / web backend style stuff. It has the tools to write pretty fast low level code, but they’re fairly clunky and if that’s all you want to write it’s probably not going to be the best tool for the job. They’re pretty nice if you have a few focused hotspots you need to optimize.

It has pretty nice CPU profiling tools, so finding and optimizing CPU hotspots is fairly pleasant. Tracking down rouge memory leaks (which lazy evaluation makes more likely) on the other hand can be extremely frustrating.

If you look at the benchmarks game results [1], the fastest haskell implementations are generally between 2 and 5 times slower than the fastest c versions, and will be written in a highly imperative style.

[1]: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

ReleaseCandidat1y ago

It's doable but harder than in imperative languages and the resulting code is _really_ ugly. See the thread of the 1BRC https://discourse.haskell.org/t/one-billion-row-challenge-in... with example code at (I hope that's the right Gist) https://gist.github.com/AndrasKovacs/e156ae66b8c28b1b84abe6b...

CyberDildonics1y ago

If you want to write fast stuff that takes advantage of SIMD, you want to use ISPC, which is made for high performance SIMD. Using haskell for this would be like doing surgery with a dull rock, you will never get what you want.

epgui1y ago

I’m not the best person to answer this question, but AFAIK it’s very very fast (in the rough vicinity of C). But also memory-hungry.

freilanzer1y ago

I'm pretty sure highly optimised code won't be elegant Haskell code though.

2 more replies

wredue1y ago

If you’re writing idiomatic Haskell. Its performance is terrible.

If you’re choosing to fight with Haskell, why? Just use something else.

To understand why people claim Haskell is “fast”, you need to understand what they mean. What they mean is “if you opted to write C in such a way as you were performing similar amounts of copious and useless data copying, pointer following and stack blowing madness, Haskell will perform that fast”. They are not saying “Haskell is as fast as the fastest idiomatic C implementation”.

Another thing you’re going to see a lot of is extremely simple anecdotes, such as counting in a loop, or favourable measure points (they will measure the whole c program, but just the point in Haskell after they’ve flattened data, for example, stating “we just want to compare those parts”).

epgui1y ago

I think if your runtime performance is terrible with Haskell, either the explanation is that you might be doing something a bit crazy, or that your application is memory-constrained.

1 more reply

londons_explore1y ago· 10 in thread

For all readers, arithmetic codes are better in nearly all ways. They can be implemented in less RAM and code, they compress and decompress to a better ratio, and the probabilities of different symbols appearing can be dynamically updated during the stream far more easily.

The only reason Huffman codes are used is they were invented first and arithmetic codes were patented. That patent has now expired, so we should use the better design.

lifthrasiir1y ago

If you do have an option to switch from Huffman, rANS is now the way to go, not a clasical arithmetic coding.

londons_explore1y ago

Two slight benefits of Huffman codes over arithmetic:

* They usually self synchronize when some data is corrupted (but not guaranteed, does not apply where the Huffman table is dynamic)

* Neither Huffman nor arithmetic codes are easy to parallelize the decoding of, but Huffman is slightly easier.

kqr1y ago

I was under the impression that arithmetic codes are guaranteed to be at least one bit less efficient than Huffman codes per input block. What makes you say they have better compression ratio?

Are you thinking of pre-defined Huffman tables that aren't adapted to the input? Because the latter ought to be as good as it gets.

(I agree with the other benefits. Since arithmetic coding tables are built in a streaming fashion rather than constructing the codebook up front, they are more memory-efficient while working.)

lifthrasiir1y ago

Huffman codes are conceptually isomorphic to arithmetic codes where all probabilities are 2^-k with k integer, so they have an obvious disadvantage due to more inaccurate symbol distribution.

1 more reply

hcs1y ago

Huffman codes are less efficient per symbol since each symbol is a bit string, arithmetic coding effectively smears symbols across bits more finely. Whether you use a dynamic or static probability model is a different issue applying to either coding method. (Emotionally though I prefer Huffman codes, they're just so neat)

lazamarOP1y ago

There is one way in which Huffman codes are better: they are easier to explain and simpler to implement.

I went for simplicity of exposition in the post, but arithmetic coders can indeed get arbitrarily close to the entropy, which is not quite the case with Huffman.

nottorp1y ago

> easier to explain

I think Huffman is the one compression algorithm that compresses stuff significantly that can fit on the proverbial napkin, so it's a good start.

The others require the whole napkin stack at the table.

userbinator1y ago

LZ is even better. Neither arithmetic nor Huffman will compress when the probability of all symbols is the same, but LZ will find repetitions easily. LZ also decompresses extremely quickly --- faster than memcpy is often mentioned.

d_burfoot1y ago

> LZ is even better. Neither arithmetic nor Huffman will compress when the probability of all symbols is the same

Comparing LZ to arithmetic encoding is a category error. LZ and Huffman are combined modeling+encoding methods, while arithmetic is just an encoding method, and it can be combined with any modeling technique. Arithmetic plus a suitable modeling technique will achieve compression as good as LZ, Huffman, or any other scheme. The PAQ8 compressors, and I believe its successors in the Hutter Prize ranking, use arithmetic plus a very advanced modeling scheme.

http://prize.hutter1.net/hfaq.htm#paq8

lazamarOP1y ago

Indeed, but worth noting that LZ is a modelling scheme, whilst Huffman is a coding technique.

That is, LZ determines, dynamically as it goes, what are all the elements we want to encode and their probabilities. Then you need a coder, like Huffman, to actually encode it.

In the post I used a semi-static zero-order byte-based model. Which means I counted the byte occurrences first and just used that count for the probabilities throughout all of the encoding. Then I used Huffman codes to translate those probabilities into bits.

But I'm considering writing a follow-up changing this static model for an LZ77 one as I think that would be fun.

mrkeen1y ago· 6 in thread

There exists an array-based, in-place algorithm for this, reducing the need to allocate trees and chase pointers.

I mention this only because, when I learned the tree-based approach at uni, I simply wasn't aware that there was another way to do it, and I'm wondering how many of you that's true for as well.

While the tree approach is intuitive and illuminating, it probably makes more sense to work with in-place arrays, since the situations when you care most about compression are probably the situations when you have a lot of data and want to run fast.

  In-Place Calculation of Minimum-Redundancy Codes
  Moffat, Katajainen.  1995.
  http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf

lifthrasiir1y ago

> In-Place Calculation of Minimum-Redundancy Codes

Or in general, refer to "On the Implementation of Minimum Redundancy Prefix Codes" by Moffat and Turpin (1997), as strongly recommended and later explained by Charles Bloom [1].

[1] https://cbloomrants.blogspot.com/2010/08/08-12-10-lost-huffm...

lazamarOP1y ago

Thanks for the link. I was motivated to write the post after reading Moffat’s book ‘Managing Gigabytes’. A pearl from the 90’s.

The authors mention this technique in the second edition.

userbinator1y ago

The JPEG standard ITU T.81 (1992) has a description of the algorithm in flowcharts, so the knowledge of array-based Huffman was probably already somewhat common in the 80s.

agentultra1y ago

It’s mentioned at the end and left as an exercise to the reader.

mjan226401y ago

> and I'm wondering how many of you that's true for as well

the phrasing sounds like a list comprehension

agumonkey1y ago

true, tickles my brain in all kinds of funny ways

tromp1y ago· 6 in thread

> To make it unambiguous we must make sure that no code word is a prefix of another code word.

Technically, this is not quite correct. The class of so-called uniquely decodable codes is unambigous, and a superset of the prefix codes. One simple example of a uniquely decodable code is the reverse of a prefix code. For the example in the article that would be

    a 1
    b 00
    c 10

While the code for a is a prefix of the code of c, one can still unambiguously decode any code sequence by processing it in reverse order. It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.

a13692099931y ago

> It would be interesting to see a [not gratuitously inefficient] uniquely decodable code that is neither a prefix code nor one in reverse.

This can be done by composing a prefix code with a suffix code:

    A   0
    B  01
    C  11
  a A  0
  b BA 010
  c BB 0101
  d BC 0111
  e C  11
  {a=0,b=010,c=0101,d=0111,e=11}

This is trivially uniquely decodable by uniquely decoding 0->A/etc backward, then uniquely decoding A->a/etc foreward. It's equivalent in lengths to the optimal prefix code {a=0,b=110,c=1110,d=1111,e=10} so it's a (one of several) optimal code for the same probability distributions.

And it's neither prefix nor suffix itself, since a=0 and b=010. In fact, it can't in general be decoded incrementally at all, in either direction, since "cee...ee?" vs "bee...ee?" and "?cc...cca" vs "?cc...ccb" both depend on unbounded lookahead to distinguish a single symbol.

I'm not sure the optimality holds for any composition of a in-isolation-optimal prefix code with a in-isolation-optimal suffix code, but it did work for the most trivial cases (other than the degenerate 1-to-1 code) I could come up with.

imurray1y ago

Nicely done; thanks.

imurray1y ago

> It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.

More interesting than I thought. First the adversarial answer; sure (edit: ah, I see someone else posted exactly the same!):

    a 101
    b 1

But it's a bad code, because we'd always be better with a=1 and b=0.

The Kraft inequality gives the sets of code lengths that can be made uniquely decodable, and we can achieve any of those with Huffman coding. So there's never a reason to use a non-prefix code (assuming we are doing symbol coding, and not swapping to something else like ANS or arithmetic coding).

But hmmmm, I don't know if there exists a uniquely-decodable code with the same set of lengths as an optimal Huffman code that is neither a prefix code nor one in reverse (a suffix code).

If I was going to spend time on it, I'd look at https://en.wikipedia.org/wiki/Sardinas-Patterson_algorithm -- either to brute force a counter-example, or to see if a proof is inspired by how it works.

n4r91y ago

It's a weird example, but what about

  a 1
  b 101

It is neither prefix-free nor suffix-free. Yet every occurrence of 0 corresponds to an occurrence of b.

However, this is obviously inefficient. So I guess the question is whether there's an optimal code which is neither prefix-free nor suffix-free.

--------------

EDIT

I did some googling and found this webpage https://blog.plover.com/CS/udcodes.html where the author gives the following example of a uniquely decodable code:

I guess this is "almost" prefix-free since the only prefix is c of d. If a message starts wiht 1, you could find the first 0 and then look at whether there's an odd or even number of 1's. So I think I can see how it's uniquely decodable. However, my crypto knowledge is too rusty to remember how to show whether this is an optimal code for some probability distribution.

imurray1y ago

That code in the EDIT is suboptimal. It doesn't saturate the Kraft inequality. You could make every codeword two bits and still encode 4 symbols, so that would be strictly better.

1 more reply

lazamarOP1y ago

That’s interesting. I guess this is not usually used because you may have a long string of bits that is ambiguous till you get to a disambiguating bit.

Something like

`100000000000000001`

In this case, where to know whether the first code was an `a` or a `c` you have to read all the way to where the zeroes end.

benreesman1y ago· 5 in thread

Haskell is a really nice language. In general I don’t identify as an X programmer for any value of X: I tend to write in a half dozen languages daily and they all suck in their own special way.

But on two separate occasions I made important career decisions with opportunity cost to work with highly lethal GHC contributors: those people are just really good.

If Haskell sucks like all languages it’s because Haskell excels at using computers to compute something: Haskell considers data shuffling a strictly secondary concern compared to doing actual computations.

random31y ago

How do you distinguish data shuffling from computation? What’s actual computation from this perspective?

mrkeen1y ago

Before I was good at Haskell, I would approach a data-processing job sequentially based on the next thing that needs to be done.

I want to open a file, and I can't read it all at once, so I'll use a FileReader and it should be buffered, so I'll wrap it with a BufferedReader. I'll use try-with-resources and click into the classes because I can't remember if the contract of the outermost reader is that it will close the inner readers too.

Right, now I'll grab the next n bytes from the stream, and start thinking about the algorithm. Swear a bit when I think about crossing the buffer boundaries, and on-and-on...

The IO concerns are very much interwoven with the algorithm.

In Haskell I just start by writing one function from bytes to bytes. That's the computation. Then when that's done I expose that function as bytes to bytes.

Others can hook it up to files, webservers, pipe it through gzip, whatever!

lucianbr1y ago

Reading a row from a database and putting it on the screen, and reading some numbers from the keyboard and putting them in the database. These things I would not call computation. I mean sure, displaying needs to compute coordinates for where to light up pixels, but that's all already written. I just call it. Same with updating btrees when writing to the db.

I'm guessing if all you do is this kind of db - screen - keyboard and back stuff, haskell is not very useful, if not actively a hindrance.

1 more reply

tossandthrow1y ago

Philosophically speaking there is no difference.

What parent commenter probably refers to is that you think in terms of computations and not in terms of data units.

And that is just tremendously elegant.

1 more reply

blandblender1y ago

> I tend to write in a half dozen languages daily

6 languages per day? Are they the same six the next day?

> they all suck in their own special way.

Not surprising if you're writing 6 different languages per day.

> Haskell excels at using computers to compute something

Can you please explain how Haskell computes medians more elegantly than say C?

bdahz1y ago· 3 in thread

How is the performance when compared to similar implementations in C/C++ or Rust?

lazamarOP1y ago

I’d say unbeatable!

The goal was simplicity of implementation and code clarity. For this kind of thing I say Haskell performs the best.

bdahz1y ago

For the simplicity of implementation and code clarity, I need to know how much I need to pay for it.

If the Haskell implementation is 3x slower than C/C++/Rust implementation, it would be acceptable.

If it's 30x slower, I would rather choose C/C++/Rust even the implementation won't be simple.

If it is even possible to be 3x faster than C/C++/Rust, then why not the mainstream programmers adopt Haskell everywhere?

2 more replies

mrkeen1y ago

That wasn't really the spirit of the question as I read it. 'Performance' has a narrower definition than that.

1 more reply

atlintots1y ago· 2 in thread

This is great! Are there any other similar tutorials going through writing a Haskell program, but with some more advanced features (monad transformers, lenses, etc)

trealira1y ago

I would recommend the book Haskell in Depth, which covers both of those topics (monad transformers by chapter 6, lenses in chapter 3 and chapter 14). It also covers some other advanced features, like Template Haskell and concurrency, and has a chapter dedicated to working with SQL databases in Haskell.

mirpa1y ago

You might try: https://github.com/turion/rhine-koans it is tutorial for FRP library Rhine, well commented with tests

chvrchbvrner1y ago· 2 in thread

I think there is a typo in the table of the "Creating prefix-free codes" section. D should be '0010' (not '0110').

Otherwise a great read, thanks!

polytely1y ago

Aha, that makes sense, I was wracking my brain as to how 0110 was unambiguous.

lazamarOP1y ago

Fixed it. Well spotted!

lo_zamoyski1y ago· 1 in thread

Btw, what is that on the girl's shirt in the image?

Direct link: https://lazamar.github.io/images/data-compressor.svg

iso8859-11y ago

She has made Jesus illegal.

lynx231y ago· 1 in thread

Very nice read, thanks for sharing!

revskill1y ago

What's nice ?

goldfishgold1y ago

Coursera’s functional programming course (in Scala) includes a pretty similar Huffman coding assignment with autograder if anybody wants to take a stab at it themselves.

https://www.coursera.org/learn/scala-functional-programming?...

banish-m41y ago

Last time I used Huffman codes, it was to run a MICMAC processor macroprogram (assembly text) in the fewest number of microcycles and to use the fewest microinstructions in the microprogram (microcode). So starting with a histogram of the macroinstructions executed (IIRC, I first wrote an interpreter in C to count how many of each were executed), I crafted a progressive decoding microcode program to implement all of the required ISA macro-operations. IIRC, the macro instruction ISA I created was bit-granular instead of byte-oriented. In the real world, it would've been slow and inconvenient. What's nice about Huffman codes is that you can vary the prefix depth based on the distribution of values, so you don't have to have lopsided codes based on 1 bit prefixes.

Also, the microprogram had to deal with branch prediction because it was a non-superscalar pipelined processor model. Guess the wrong branch, and enjoy wasting cycles on a pipeline stall while the correct branch filters forward.

tankfeeder1y ago

https://rosettacode.org/wiki/Huffman_coding

alwinaugustin1y ago

Thanks for sharing. Very nice and insightful.

2-3-7-43-18071y ago

and yet another episode in the series of "look, what I did with Haskell"

j / k navigate · click thread line to collapse

88 comments

65 comments · 15 top-level

ykonstant1y ago· 14 in thread

lazamarOP1y ago

Haskell's speed can be competitive with systems languages but keep in mind that its killer feature is ease of abstraction.

The idea is that it is simple to assemble multiple parts into a coherent, well organised program. Which is important for the entirety of the program, no just the tight loop.

mrkeen1y ago

I like Haskell performance for every-day backend/web and CLI stuff. But I drop down into Rust when I'm writing something performance-focused.

That said, Haskell's no slouch. Here's a small program to count the 1-bits in a file.

  main :: IO ()
  main = do
    content <- getArgs >>= \[a] -> unsafeMMapVector a Nothing
    print (vectorPopCount content)

  vectorPopCount :: V.Vector Word64 -> Int
  vectorPopCount = V.foldl' (+) 0 . V.map popCount

I haven't tried crunching matrices, but I would start by checking out repa, accelerate, or massiv.

  https://hackage.haskell.org/package/repa
  https://hackage.haskell.org/package/accelerate
  https://hackage.haskell.org/package/massiv

ykonstant1y ago

The lack of heap allocations is great! Thanks for the pointers.

Iceland_jack1y ago

I met Sam Derbyshire at ZuriHac who told me all the difficult architectural work had been done for SIMD support.

+ https://gitlab.haskell.org/ghc/ghc/-/issues/7741

It might make it for GHC 9.12 (for 128 bit vectors only, and mostly floating-point operations unless other people come in and contribute).

The patch is at:

+ https://gitlab.haskell.org/ghc/ghc/-/merge_requests/12860

ykonstant1y ago

Thanks for the info!

wyager1y ago

I wouldn't write a matrix multiplication algorithm in Haskell, but I also wouldn't write one in rust or C if I was serious about speed.

Many focus on number crunching as a performance metric, but almost no one is ever actually bottlenecked on that, and if they are it doesn't really matter what high level language they're using.

GrantMoyer1y ago

> The reality is that for any language, including C, compiler optimized code will never be as fast as hand optimized code

That's not strictly true; sometimes a C compiler can optimize away my whole program: https://godbolt.org/z/oG5nfGE6z

1 more reply

tkz13121y ago

[1]: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

ReleaseCandidat1y ago

CyberDildonics1y ago

epgui1y ago

I’m not the best person to answer this question, but AFAIK it’s very very fast (in the rough vicinity of C). But also memory-hungry.

freilanzer1y ago

I'm pretty sure highly optimised code won't be elegant Haskell code though.

2 more replies

wredue1y ago

If you’re writing idiomatic Haskell. Its performance is terrible.

If you’re choosing to fight with Haskell, why? Just use something else.

epgui1y ago

I think if your runtime performance is terrible with Haskell, either the explanation is that you might be doing something a bit crazy, or that your application is memory-constrained.

1 more reply

londons_explore1y ago· 10 in thread

The only reason Huffman codes are used is they were invented first and arithmetic codes were patented. That patent has now expired, so we should use the better design.

lifthrasiir1y ago

If you do have an option to switch from Huffman, rANS is now the way to go, not a clasical arithmetic coding.

londons_explore1y ago

Two slight benefits of Huffman codes over arithmetic:

* They usually self synchronize when some data is corrupted (but not guaranteed, does not apply where the Huffman table is dynamic)

* Neither Huffman nor arithmetic codes are easy to parallelize the decoding of, but Huffman is slightly easier.

kqr1y ago

I was under the impression that arithmetic codes are guaranteed to be at least one bit less efficient than Huffman codes per input block. What makes you say they have better compression ratio?

Are you thinking of pre-defined Huffman tables that aren't adapted to the input? Because the latter ought to be as good as it gets.

(I agree with the other benefits. Since arithmetic coding tables are built in a streaming fashion rather than constructing the codebook up front, they are more memory-efficient while working.)

lifthrasiir1y ago

Huffman codes are conceptually isomorphic to arithmetic codes where all probabilities are 2^-k with k integer, so they have an obvious disadvantage due to more inaccurate symbol distribution.

1 more reply

hcs1y ago

lazamarOP1y ago

There is one way in which Huffman codes are better: they are easier to explain and simpler to implement.

I went for simplicity of exposition in the post, but arithmetic coders can indeed get arbitrarily close to the entropy, which is not quite the case with Huffman.

nottorp1y ago

> easier to explain

I think Huffman is the one compression algorithm that compresses stuff significantly that can fit on the proverbial napkin, so it's a good start.

The others require the whole napkin stack at the table.

userbinator1y ago

d_burfoot1y ago

> LZ is even better. Neither arithmetic nor Huffman will compress when the probability of all symbols is the same

http://prize.hutter1.net/hfaq.htm#paq8

lazamarOP1y ago

Indeed, but worth noting that LZ is a modelling scheme, whilst Huffman is a coding technique.

That is, LZ determines, dynamically as it goes, what are all the elements we want to encode and their probabilities. Then you need a coder, like Huffman, to actually encode it.

But I'm considering writing a follow-up changing this static model for an LZ77 one as I think that would be fun.

mrkeen1y ago· 6 in thread

There exists an array-based, in-place algorithm for this, reducing the need to allocate trees and chase pointers.

I mention this only because, when I learned the tree-based approach at uni, I simply wasn't aware that there was another way to do it, and I'm wondering how many of you that's true for as well.

  In-Place Calculation of Minimum-Redundancy Codes
  Moffat, Katajainen.  1995.
  http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf

lifthrasiir1y ago

> In-Place Calculation of Minimum-Redundancy Codes

Or in general, refer to "On the Implementation of Minimum Redundancy Prefix Codes" by Moffat and Turpin (1997), as strongly recommended and later explained by Charles Bloom [1].

[1] https://cbloomrants.blogspot.com/2010/08/08-12-10-lost-huffm...

lazamarOP1y ago

Thanks for the link. I was motivated to write the post after reading Moffat’s book ‘Managing Gigabytes’. A pearl from the 90’s.

The authors mention this technique in the second edition.

userbinator1y ago

The JPEG standard ITU T.81 (1992) has a description of the algorithm in flowcharts, so the knowledge of array-based Huffman was probably already somewhat common in the 80s.

agentultra1y ago

It’s mentioned at the end and left as an exercise to the reader.

mjan226401y ago

> and I'm wondering how many of you that's true for as well

the phrasing sounds like a list comprehension

agumonkey1y ago

true, tickles my brain in all kinds of funny ways

tromp1y ago· 6 in thread

> To make it unambiguous we must make sure that no code word is a prefix of another code word.

    a 1
    b 00
    c 10

a13692099931y ago

> It would be interesting to see a [not gratuitously inefficient] uniquely decodable code that is neither a prefix code nor one in reverse.

This can be done by composing a prefix code with a suffix code:

    A   0
    B  01
    C  11
  a A  0
  b BA 010
  c BB 0101
  d BC 0111
  e C  11
  {a=0,b=010,c=0101,d=0111,e=11}

imurray1y ago

Nicely done; thanks.

imurray1y ago

> It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.

More interesting than I thought. First the adversarial answer; sure (edit: ah, I see someone else posted exactly the same!):

    a 101
    b 1

But it's a bad code, because we'd always be better with a=1 and b=0.

But hmmmm, I don't know if there exists a uniquely-decodable code with the same set of lengths as an optimal Huffman code that is neither a prefix code nor one in reverse (a suffix code).

n4r91y ago

It's a weird example, but what about

  a 1
  b 101

It is neither prefix-free nor suffix-free. Yet every occurrence of 0 corresponds to an occurrence of b.

However, this is obviously inefficient. So I guess the question is whether there's an optimal code which is neither prefix-free nor suffix-free.

--------------

EDIT

I did some googling and found this webpage https://blog.plover.com/CS/udcodes.html where the author gives the following example of a uniquely decodable code:

imurray1y ago

That code in the EDIT is suboptimal. It doesn't saturate the Kraft inequality. You could make every codeword two bits and still encode 4 symbols, so that would be strictly better.

1 more reply

lazamarOP1y ago

That’s interesting. I guess this is not usually used because you may have a long string of bits that is ambiguous till you get to a disambiguating bit.

Something like

`100000000000000001`

In this case, where to know whether the first code was an `a` or a `c` you have to read all the way to where the zeroes end.

benreesman1y ago· 5 in thread

Haskell is a really nice language. In general I don’t identify as an X programmer for any value of X: I tend to write in a half dozen languages daily and they all suck in their own special way.

But on two separate occasions I made important career decisions with opportunity cost to work with highly lethal GHC contributors: those people are just really good.

random31y ago

How do you distinguish data shuffling from computation? What’s actual computation from this perspective?

mrkeen1y ago

Before I was good at Haskell, I would approach a data-processing job sequentially based on the next thing that needs to be done.

Right, now I'll grab the next n bytes from the stream, and start thinking about the algorithm. Swear a bit when I think about crossing the buffer boundaries, and on-and-on...

The IO concerns are very much interwoven with the algorithm.

In Haskell I just start by writing one function from bytes to bytes. That's the computation. Then when that's done I expose that function as bytes to bytes.

Others can hook it up to files, webservers, pipe it through gzip, whatever!

lucianbr1y ago

I'm guessing if all you do is this kind of db - screen - keyboard and back stuff, haskell is not very useful, if not actively a hindrance.

1 more reply

tossandthrow1y ago

Philosophically speaking there is no difference.

What parent commenter probably refers to is that you think in terms of computations and not in terms of data units.

And that is just tremendously elegant.

1 more reply

blandblender1y ago

> I tend to write in a half dozen languages daily

6 languages per day? Are they the same six the next day?

> they all suck in their own special way.

Not surprising if you're writing 6 different languages per day.

> Haskell excels at using computers to compute something

Can you please explain how Haskell computes medians more elegantly than say C?

bdahz1y ago· 3 in thread

How is the performance when compared to similar implementations in C/C++ or Rust?

lazamarOP1y ago

I’d say unbeatable!

The goal was simplicity of implementation and code clarity. For this kind of thing I say Haskell performs the best.

bdahz1y ago

For the simplicity of implementation and code clarity, I need to know how much I need to pay for it.

If the Haskell implementation is 3x slower than C/C++/Rust implementation, it would be acceptable.

If it's 30x slower, I would rather choose C/C++/Rust even the implementation won't be simple.

If it is even possible to be 3x faster than C/C++/Rust, then why not the mainstream programmers adopt Haskell everywhere?

2 more replies

mrkeen1y ago

That wasn't really the spirit of the question as I read it. 'Performance' has a narrower definition than that.

1 more reply

atlintots1y ago· 2 in thread

This is great! Are there any other similar tutorials going through writing a Haskell program, but with some more advanced features (monad transformers, lenses, etc)

trealira1y ago

mirpa1y ago

You might try: https://github.com/turion/rhine-koans it is tutorial for FRP library Rhine, well commented with tests

chvrchbvrner1y ago· 2 in thread

I think there is a typo in the table of the "Creating prefix-free codes" section. D should be '0010' (not '0110').

Otherwise a great read, thanks!

polytely1y ago

Aha, that makes sense, I was wracking my brain as to how 0110 was unambiguous.

lazamarOP1y ago

Fixed it. Well spotted!

lo_zamoyski1y ago· 1 in thread

Btw, what is that on the girl's shirt in the image?

Direct link: https://lazamar.github.io/images/data-compressor.svg

iso8859-11y ago

She has made Jesus illegal.

lynx231y ago· 1 in thread

Very nice read, thanks for sharing!

revskill1y ago

What's nice ?

goldfishgold1y ago

Coursera’s functional programming course (in Scala) includes a pretty similar Huffman coding assignment with autograder if anybody wants to take a stab at it themselves.

https://www.coursera.org/learn/scala-functional-programming?...

banish-m41y ago

tankfeeder1y ago

https://rosettacode.org/wiki/Huffman_coding

alwinaugustin1y ago

Thanks for sharing. Very nice and insightful.

2-3-7-43-18071y ago

and yet another episode in the series of "look, what I did with Haskell"

j / k navigate · click thread line to collapse