log(“Began {x} connected to {y}”)
log(“Ended {x} connected to {y}”)
We can label the first one logging operation 1 and the second one logging operation 2.
Then, if logging operation 1 occurs we can write out:
1, {x}, {y} instead of “Began {x} connected to {y}” because we can reconstruct the message as long as we know what operation occurred, 1, and the value of all the variables in the message. This general strategy can be extended to any number of logging operations by just giving them all a unique ID.
That is basically the source of their entire improvement. The only other thing that may cause a non-trivial improvement is that they delta encode their timestamps instead of writing out what looks to be a 23 character timestamp string.
The columnar storage of data and dictionary deduplication, what is called Phase 2 in the article, is still not fully implemented according to the article authors and is only expected to result in a 2x improvement. In contrast, the elements I mentioned previously, Phase 1, were responsible for a 169x(!) improvement in storage density.