Code here:
https://github.com/DataDog/sketches-go
A note on Accuracy: At Circonus, we have been using a version of HDR-Histograms [1] for many years to aggregate latency distributions, and calculate accurate aggregated percentiles. Accuracy was never a problem (worst-case error <5%, usually _much_ better).
If I read your evaluation results correctly, you also found HRD-Histograms to be as-accurate or more-accurate, than DDSketches, correct?
The differentiator to HDR Histograms seems to be merging speed and size, where DDSketches seem to have an edge.
One thing that is not immediately clear to me from reading the paper is, how much of the distribution function can be reconstructed from the sketch? E.g. for SLO calculations one is often interested in latency bands: "How many requests were faster than 100ms?" [2].
Is it possible to approximate CDF values ("lower counts") from the sketch with low/bounded error?
[1] https://github.com/circonus-labs/libcircllhist
[2] http://heinrichhartmann.com/pdf/Heinrich%20Hartmann%20-%20La...
To clarify though, DDSketch as defined in the paper uses the logarithmic mapping, as our main goal was to make the memory footprint as small as possible. See: https://github.com/DataDog/sketches-java/blob/1650d939f1485f...
The Java implementation abstracts out the index mapping. This let us add alternative ways to map values to indices and we added the method with the quadratic interpolation as it seems to be a good tradeoff between index computation speed and memory footprint.
Either way - looks very promising, I am excited to take a closer look and possibly use this. Thanks!
But then I got completely lost in the math to show why this guarantees anything. What's actually preventing the scenario where everything falls into the same or just a few bins? (as so happens in the real world where all your data is of a relatively similar order of magnitude) For example what would happen if my data is random from 1 to 10? And on a different server random 1000 to 1010?
Less than that, with the appropriate compression scheme (sparse encoding).
> But then I got completely lost in the math to show why this guarantees anything.
You get guarantees on _relative_ error:
- 100 to 110 is a 10% error
- 1000 to 1010 is a 1% error
Depending on your logarithmic base, you can detect errors up to a certain relative %-tage.
This may be true for the provided implementation, but it's not inherent to the algorithm. The calcuation of the higher order moments would normally done entirely in the log domain where there's little risk of overflow.
I'm interested in this paper because I worked on a somewhat related problem some time ago, but got stuck on how to handle data that morphs into a mixed-modal distribution. Modes that are close together are no big deal, but modes that are spaced more exponentially apart are tricky to deal with. For an example of something that would be in DataDog's purview, it would be like trying to sketch the histogram of response times from an endpoint that sometimes took a "fast" path (e.g. a request for a query whose result was cached), sometimes took a "normal" path, and sometimes took a "slow" path. (e.g. a query with a flag that requested additional details to be computed) If the response times from the slow path is much bigger than the others, e.g. by an order of magnitude, their statistics might essentially drown-out the data from the other two paths since you're using them to calculate bin size.
I noticed you had some results from measuring DDSketch's performance on a mixed-modal distribution that looked pretty good (that "power" distribution on the last page). I was wondering if you had done any more investigation in this area? E.g. how messy/mixed can the data be before the sketch starts to break down?
https://github.com/tdunning/t-digest/blob/master/docs/t-dige...
The problems of having high relative errors on the larger quantiles has been addressed by a line of work that still uses rank error, but promises lower rank error on the quantiles further away from the median by biasing the data it keeps towards the higher (and lower) quantiles [7], [8], [17]. The latter, dubbed t-digest, is notable as it is one of the quantile sketch implementations used by Elasticsearch [18]. These sketches have much better accuracy (in rank) than uniform-rank-error sketches on percentiles like the p99.9, but they still have high relative error on heavy-tailed data sets. Like GK they are only one-way mergeable.
A 0.01 relative-accurate sketch has to give you a value within a factor of 100.
The above example is contrived, but with real web request data, there is often a very long tail, and rank accurate sketches quickly start giving values far from the actual percentiles.
So to more directly answer you question, it's that the maximum error possible is bounded.