undefined | Better HN

0 pointsscott_s13y ago0 comments

One of those events, yes. But it's possible for the system to be experiencing a bursty workload unrelated to your benchmark, and many of those events may happen. There's also the problems of startup effects, both at the high level (the VM, which in this case is .Net), the medium level (major and minor page faults) and the low level (caches).

My rule of thumb is that benchmarks which are supposed to be bound by the processor and memory should last at least 60 seconds.

0 comments

3 comments · 1 top-level

stephencanon13y ago· 2 in thread

VM startup effects I can get behind as a confound; page faults and cache effects are below millisecond level (filling the beefiest Sandybridge Xeon L3 cache you can buy from a completely cold state is on the order of 1 millisecond, and a micro benchmark like this doesn’t come close to using that much data).

I would also note that one is sometimes in the position of needing to measure performance of a compute-intensive task that is latency-critical but will not be running constantly; in such a scenario, using long-running benchmarks can be misleading because the processor will become thermally constrained and drop in and out of lower voltage/frequency bands, further confounding measurements.

I agree with you that a tenth of a second is on the shorter side of what I would like to see in such a benchmark, but I don’t think the situation is as dire as your first post suggested; unless the system is exceptionally noisy, the measurements seem to be valid, despite the relatively short duration. 60 seconds is overkill for a simple task like this.

scott_sOP13y ago

Again, it's repeated page faults and cache effects.

When you run experiments, you want to draw conclusions. To have confidence in your conclusions, you want to eliminate as many variables as possible. In my work, I set the time of the benchmark high enough that I am confident that it is very unlikely for these effects to have a significant influence on the results. When you're drawing conclusions and publishing the results that will be scrutinized by peers, "overkill" is the way to go.

Also note that I was not the first poster on this subject.

stephencanon13y ago

You cannot eliminate confounds by simple over-measurement. “Overkill” provides false confidence.

The only way to eliminate confounds is to understand them, and either control for them or bound them to an acceptable error tolerance. For a simple benchmark such as this, cache misses and page faults reach steady state within the first millisecond of operation; the error they contribute to the measurement of a .1s benchmark (even in aggregate) is no more than 1% — almost surely acceptable.

I have no experience with .Net, and would not care to make any estimates on the contribution of VM startup time, but the experiment in question does not include the VM startup in the measurement.

If a system were so noisy as to have interrupt storms on the order of .1s, then I would not be comfortable with timings that run for 60s either. I would much rather have statistics on 100 measurements of .1s each, which would make clear the impact of such anomalies (while still being faster to gather). There are many events that can make such measurements slower, but almost none that can make them faster; the distribution of the measurements is typically well-modeled by a Poisson distribution with bias. If one is actually trying to eliminate the effect of those events from the measurement, taking the minimum over many short samples is actually much closer to the truth than averaging over one long sample. If instead one is trying to include the effect of such events, then a different statistic would be in order.

j / k navigate · click thread line to collapse