It reminds be of the 100 prisoners problem [0]. Yes, I understand why it works. I can see how it works mathematically. But it still feels like it shouldn't work, that we're somehow getting something for free.
https://bugfix-66.com/7f0f425d3eee8def16e1102d054fd45394d027...
Wikipedia gives an impractical, inefficient account of the algorithm. It's almost comically bad.
But the BWT has an efficient inverse transform with very short, simple code, as seen above. This is the classic linked list algorithm.
The forward transform is also simple if done right. See the Go code here:
https://bugfix-66.com/72949624ebbb28b3d0ce5d700970a8857d354b...
Together, that's full code for industrial-strength forward and inverse BWT. All that's missing is some method of avoiding degeneracy in the forward transform string sort (e.g., noising long runs in the input or a suffix array).
I understand the point of the debugging game, but in this context, a link to the solutions page would be more helpful....
I encourage you to edit and fix it.
It's a fantastic example of minimal yet fully functional!
The splay tree also falls squarely into this category for me. The core bit:
> [...] splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern.
I've been iterating a kind of append-only log + database system using this as a technique to optimize I/O patterns. Many forms of storage are very happy when you feed them sequential bytes and prefer reads from more-recently-written physical blocks.
The typical data compression algorithm starts with a combinatorial transformation of the data. Then it creates a statistical model of the transformed data and encodes it according to the model. There are two successful families of combinatorial transformations: Lempel–Ziv and Burrows–Wheeler. Neither of them is superior to the other, but the compression performance depends on the properties of the data.
Lempel–Ziv parsing replaces copies of repeated substrings with references to an earlier copy. It works best when there are long repeated substrings in the data. The modeling/encoding parts of an LZ compressor are usually pretty simple.
Burrows-Wheeler transform reorders the symbols according to the context they occur in. It's often but not always followed by run-length encoding. Because the symbols are reordered by context, BWT-based compressors work best when there are many copies of each repeated substring.
Unlike the LZ, the BWT does not compress the data on its own. It relies entirely on statistical compression. Because the symbols are ordered by context, the BWT is an order-n model for every n simultaneously. Hence you get something similar to PPM by simply partitioning the BWT into blocks and encoding each block independently with an order-0 encoder.
Btw: what's the best source for these kind of one of a kind, contra-intuitive algos and datastructures? Be it online, be it books...
BWT can't be the only one, right?
For the 100 prisoners problem, the intuition corresponds to n-nested verb tenses, which don’t exist in any language AFAIK.
As an example, let's say prisoner #10 opens his box and finds #21, then finds #5 in that box, then #84, then #51, then finally succeeds and finds his number 10 in box #51. These boxes form a cycle 10-21-5-84-51 (and then back to 10). Anyone who opens any of these boxes will eventually see the same set numbers that #10 did, so that means that we know prisoners #5, #10, #21, #51 and #84 will all succeed in finding their number by starting at their own box.
Compare that to the situation where they just randomly look in boxes and each independently have only a 50% chance to succeed - then the odds that those 5 prisoners all succeed would be (1/2)^5 = 1/32, or only about 3%. In every cycle containing n prisoners, instead of their success rate being (1/2)^n, now they all succeed together or all fail together.
Now that the success of every prisoner in the same cycle is perfectly correlated with each other, the prisoners' overall chance of success depends only on whether the random permutation created by the warden has any cycles >50 in length. If so, then everyone in that cycle will fail. If not, then all the cycles across the 100 boxes are short enough that all 100 prisoners will succeed.
Rather than a normal distribution (sum of independent outcomes) with a big peak around 50% of the prisoners finding their number and a vanishingly small chance of them all (or none) finding the right number, you end up with a much more complex pattern - there's an approximately uniform distribution in the 0-50 successes range with ~70% of the overall outcome, and a huge peak at the 100 successes point with ~30%.
https://www.r-bloggers.com/2014/08/update-100-prisoners-100-... has a bunch of distribution plots showing this.
Think of it like having to search for someone in a forest with only enough time to search half the forest.
Would you want to search random spots or be dropped randomly into the trail to the person?
You might yet run out of time, but on average it'll be much more efficient. Surprisingly so in the 100 boxes context.
Veritasium has an excellent video on it but I'm on mobile and can't dig out a link. I'd recommend it though.
https://en.wikipedia.org/wiki/FM-index
The BWT is one of those almost magical tools for compression. But using it for speedy string search is a whole other amazing invention too.
He shares the origin of the algorithm as well as a story about how it was first published.
The Compressor Head video series is the best introduction to compression that I’ve found.
Apparently, back in the days, Mike had some sort of philosophical opposition to indenting his C code.
I was impressed he managed to write such a complex piece of code without ever indenting anything.
That explains his aversion to Python back then: https://archive.is/OFmNT
imo, Mike Burrows' impact on the industry rivals that of the Turing awardees. Ex A from 2008: https://archive.is/ZPd4f
JPL and Google were early Python users. Google's crawler and the home page started out in Python (trivia: the ancient asyncore stdlib module and its author played some role, as far as I remember)
Pretty much all AI and self driving cars use Python, BitTorrent was written in Python, Python is embedded in GDB, embedded in huge commercial applications like Maya for visual effects, etc. I think Ethereum had an early Python implementation
Python's design achieved Guy Steele's "growing a language" vision in his famous talk, i.e. using operator overloading to allow scientists to create their own lingua franca -- https://news.ycombinator.com/item?id=29171519 (i.e. the talk was about adding operator overloading to Java, as well "generics" and value types)
Mike Burrows is a great (even legendary) programmer and computer scientist, but if you're talking about "impact on the industry", there are levels :) I have to say I'm a big Alta Vista fan though
I wonder if he'd enforce a 's/^\s*//g' pre-commit hook if you worked with him...
Burrows-Wheeler Transform [video] - https://news.ycombinator.com/item?id=10721401 - Dec 2015 (5 comments)
Compression with the Burrows-Wheeler Transform - https://news.ycombinator.com/item?id=1112845 - Feb 2010 (6 comments)
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
If you're interested in string search algorithms one of the cooler ones to come from genomics needs is Bitap as it has some scaling by alphabet size but DNA has a small alphabet https://en.wikipedia.org/wiki/Bitap_algorithm
1: https://en.m.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93M...
Just yesterday I had a file where bzip2 won. I would love to see a modern version with the first few stages of bwt feeding into xz or zstd.
Question is if that is useful for compression.