You cannot eliminate confounds by simple over-measurement. “Overkill” provides false confidence.
The only way to eliminate confounds is to understand them, and either control for them or bound them to an acceptable error tolerance. For a simple benchmark such as this, cache misses and page faults reach steady state within the first millisecond of operation; the error they contribute to the measurement of a .1s benchmark (even in aggregate) is no more than 1% — almost surely acceptable.
I have no experience with .Net, and would not care to make any estimates on the contribution of VM startup time, but the experiment in question does not include the VM startup in the measurement.
If a system were so noisy as to have interrupt storms on the order of .1s, then I would not be comfortable with timings that run for 60s either. I would much rather have statistics on 100 measurements of .1s each, which would make clear the impact of such anomalies (while still being faster to gather). There are many events that can make such measurements slower, but almost none that can make them faster; the distribution of the measurements is typically well-modeled by a Poisson distribution with bias. If one is actually trying to eliminate the effect of those events from the measurement, taking the minimum over many short samples is actually much closer to the truth than averaging over one long sample. If instead one is trying to include the effect of such events, then a different statistic would be in order.