In fact, performance numbers (latencies) often follow a heavy-tailed distribution. For these, you need a literal shitload of samples to get even a slightly normal mean. For these, the sample mean, the sample variance, the sample centiles -- they all severely underestimate the true values.
What's worse is when these tools start to remove "outliers". With a heavy-tailed distribution, the majority of samples don't contribute very much at all to the expectation. The strongest signal is found in the extreme values. The strongest signal is found in the stuff that is thrown out. The junk that's left is the noise, the stuff that doesn't tell you very much about what you're dealing with.
I stand firm in my belief that unless you can prove how CLT applies to your input distributions, you should not assume normality.
And if you don't know what you are doing, stop reporting means. Stop reporting centiles. Report the maximum value. That's a really boring thing to hear, but it is nearly always statistically and analytically meaningful, so it is a good default.
So, e.g.: for a normal distribution, with very narrow tails -- probabilities like exp(-x^2) -- your sample mean is the maximum-likelihood estimator. For a double-exponential (Laplace) distribution, whose tails are like exp(-|x|) and therefore much fatter, the maximum-likelihood estimator is the _median_, which gives much less weight to outliers. Another way to look at it: the mean minimizes the sum of |x-m|^2 and the median minimizes the sum of |x-m|, and the former grows in size and hence importance much faster as x gets large.
In benchmarks, assuming you run the same workload each time, you often want the minimum value. Anything else just tells you how much system overhead you encountered.
(Complete agreement that applying statistics without knowing anything about the distribution can mislead.)
There's an argument to Blob being the better choice, if this will run in production on a system that might encounter unrelated loads. Predictable performance is frequently more useful than theoretical maximum performance.
That said, I still agree a little bit. The minimum value is also a useful metric, and if you have the opportunity to report two numbers, the minimum-maximum pair is a great choice.
This is when the code you write is deterministic and the interference is not. The minimum is closer to what you would get without interference.
Just don't effect the time to represent typical results. (And why would you expect that if you're running benchmarks on your development machine?)
> Most -- nearly all -- benchmarking tools like this work from a normality assumption
I don't think that hyperfine makes any assumption about normality. Sure, we do report sample mean and sample standard deviation by default, but we also report sample minimum and the maximum. You can also easily export all the benchmark results and inspect in more detail with the supplied Python scripts.
> In fact, performance numbers (latencies) often follow a heavy-tailed distribution
So when is this really the case? In my understanding, if I am measuring the runtime of a deterministic program with the same input, the runtime should only be influenced by external factors that are out of my control (other programs being scheduled, caching effects, hardware-specific influences, ..). These are exactly the things that I want to "average out" by running the benchmark multiple times.
> What's worse is when these tools start to remove "outliers".
Hyperfine never removes outliers. What we do is to try and detect outliers. We do this by computing robust statistical estimates that specifically DO NOT assume a normal distribution (see https://github.com/sharkdp/hyperfine/blob/master/src/hyperfi... for details).
We perform this outlier detection to warn users about potentially interfering processes or caching effects.
Take a look at these results, for example: https://i.imgur.com/XRvE6Ys.png
I benchmarked a file-searching program. The underlying distribution, while probably not normal, seems to be "well behaved" and I think that the sample mean and the sample standard deviation could be quantities with a reasonably predictive power.
What you do NOT see in the histogram is a single outlier at 1.15 seconds, far outside the plot to the right. This was the first benchmark run where the disk caches were still cold. In such a case, hyperfine warns the user:
Warning: The first benchmarking run for this command was significantly slower than the rest (1.152 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
In conclusion, I am not quite sure how your critisism applies to hyperfine, but I'd be happy to get further feedback.
> Sure, we do report sample mean and sample standard deviation by default, but we also report sample minimum and the maximum. You can also easily export all the benchmark results and inspect in more detail with the supplied Python scripts.
Hyperfine presents itself as a black-box tool where you plug in program and it is supposed to outputs usable numbers. For that type of tool, "but you can script up a custom report" is not how you defend wildly speculative reporting!
I know trying to black-box performance measurements is very, very annoying, because things we are used to take for certain simply don't hold. But that is an inherent problem of the field, and not something one can wish away.
> In my understanding, if I am measuring the runtime of a deterministic program with the same input, the runtime should only be influenced by external factors that are out of my control (other programs being scheduled, caching effects, hardware-specific influences, ..). These are exactly the things that I want to "average out" by running the benchmark multiple times.
This is broadly correct. The common misconception regards how many runs are needed to successfully average out these external facts: probably more than you want your user to wait through. Sums of numbers drawn from heavy-tailed distributions converge very slowly to the normal distribution, to the point where a lot of people people won't wait for it to become even close to normal.
> Hyperfine never removes outliers. What we do is to try and detect outliers. We do this by computing robust statistical estimates that specifically DO NOT assume a normal distribution (see ... for details).
Sorry, that was my misreading. I'm glad you don't remove outliers! I like that you're using robust estimations, but I'm still not convinced they work as well as we would want to. I'm sure someone else could formalise this, but just based on very simple experimentation[1], I get some worrying results: When the cutoff value is chosen so that D > 3.5 is labeled as an outlier, the samples labeled as "outliers" reliably contribute around 0.75 to the expectation of the sample.
Despite using robust estimations, the outliers completely dominate any expectations about the sample.
> Take a look at these results, for example: https://i.imgur.com/XRvE6Ys.png
> I benchmarked a file-searching program. The underlying distribution, while probably not normal, seems to be "well behaved" and I think that the sample mean and the sample standard deviation could be quantities with a reasonably predictive power.
To me, that also looks like a heavy-tailed distribution where there simply aren't enough samples to reveal the extremal values that exist in the real population. If you still have the raw data, we could try a K-S test against the MLE fitting of some common heavy-tailed distributions to see if it's possible to rule them out, but I suspect we won't be able to do that.
Old discussion: https://news.ycombinator.com/item?id=16193225
Looking forward to your feedback!
No, we never thought about HTML output. However, there are multiple other export options and we also ship Python scripts that can be used to plot the benchmark results. The script is not very large, so far, but we are happy to add new scripts if the need for one should arise. What kind of diagrams would you like to see?
Also, I have never thought about using criterion.rs. My feeling was that it is suited for benchmarks with thousands of iterations, while we typically only have tens of iterations in hyperfine (as we typically benchmark programs with execution times > 10 ms). Do you have anything specific criterion feature in mind that we could benefit from?
FWIW I wrote a rough first version of a tool that runs a hyperfine benchmark over all commits in a repo and plots the results in order to see which commits cause performance changes: https://github.com/dandavison/chronologer
In the past, I’ve cobbled together quick bash pipelines to run time in a loop, awk out timings, and compute averages, but it was always a pain. Hyperfine has a great interface and really useful reports. It actually reminds me quite a bit of Criterion, the benchmarking suite for Rust.
I also use fd and bat extensively, so thanks for making such useful tools!