Back when we had to live with sort of printing-press methods of displaying information (ie, where anything that wasn't pure text was very difficult to display), mean/median/mode numbers were a necessary evil. But if you're looking at a computer screen, there's really no reason to subject yourself to an abstraction that throws out 90% of your data.
Showing the full histogram isn't a full solution either, though. Not only does using the average latency obscure the issue by boiling it down to a single scalar, but the full histogram of latencies also loses the information on note-to-note consistency! That's because a latency histogram loses sequencing information, so it doesn't distinguish between the case where you had a lot of 20ms latencies in a row followed by a lot of 50ms latencies in a row, and the case where every other message oscillated between 20ms and 50ms latencies (much worse). You can try to capture some of that information by making a histogram of adjacent-latency deltas, as one attempt. Or you can capture a different view on it by plotting latency vs. time and looking for spikes (but that can obscure less-obvious trends, and is unwieldy as a data representation if you're trying to summarize a system's behavior over a period of hours).
The paper is here, though the actual numbers are 9 years old at this point, so probably not that useful: http://www.cs.hmc.edu/~bthom/res/midi_timing/publications/IC...
Came here to say exactly this. And averages are especially insidious when used for data that doesn't have a symmetric distribution, like most latencies.
Author here. I think most people on HN would echo your sentinment about averages wholesale ... But I wanted to go a little deeper into selecting a better alternative for operational monitoring.
Its easy to say "averages are bad" but harder to say "use X instead", and explain why. We tried. Do you think we did it?
And you can always throw things into gnuplot to get a quick, exploratory look at things. It will at least give you sense of whether you're looking at a normal distribution, something skewed, multi-modal distributions etc etc.
I am in complete agreement. Unfortunately, a lot of monitoring and APM tools still lead with average response time as one of the toplevel metrics. And a lot of people still make incorrect assumptions based on it.
Although, the percentile on average latency is not great either. I try to make the case for using a metric that counts acceptable experiences vs. their latency value, e.g. the Apdex index or our derived sat score.
Best, Mike
Median latency -- perhaps with (the smoothing-effect of) a rolling median -- would be more robust to outliers without having to resort to hardcoding of "too slow" thresholds. It would still require the human to connect the dots (e.g. median latency of >200 is "too slow") but it's an improvement on mere average response time for reasons noted.
Sounds like a good problem for an analytics startup to tackle.
Unfortunately, you HAVE TO do it. If you do not set a threshold for what is acceptable, how do you determine whether or not your are providing an acceptable experience to your users?
No amount of aggregate metrics can help you answer this question unless you know whats acceptable, and what isnt - for each important set of URLs in your app.
I agree that its "hard" to do. In our own product (https://www.leansentry.com), we solve this problem by grouping urls, and using good defaults / making it easy to override the thresholds for users.
To use one example which is prevalent throughout marketing, advertising, etc. Google Analytics reports only averages – this makes the results unreliable enough that I'm now advising people to simply pretend that field does not exist as it's completely untrustworthy. Awhile back I blogged about an example where 3 samples out of 200K threw the average off by a full order of magnitude: http://chris.improbable.org/2012/05/18/google-analytics-dece...
Author here. The 99 or 95 percentile is a much better metric! We also make the case for the industry standard Apdex or our derived metric, sat score. These are becoming more and more in use by APM tools like us or New Relic.
Unfortunately, many existing tools and people who use them still look at latency aggregates and often make incorrect assumptions.
I completely agree! Keeping toplevel metrics SIMPLE is the key. Of course, simple but also not misleading you into any wrong beliefs.
While we liked the 95 percentile approach, we decided against it. Its still too focused on the actual response time itself, which we thought was less relevant than the number of users experiencing bad performance.
I think for us the bottom line was:
A) If you are having a site-wide performance issues, 95% percentile is a good metric.
B) However, if you have more isolated issues (we find this happens more often to more mature sites), satisfaction score is better.
Best, Mike
Standard deviation is often a useful metric, but it's at least as flawed as mean in skewed distributions because it doesn't treat either direction around the (already flawed) mean any differently.
Author here. Did you also search for stdev, st.dev, variance? Just kidding.
The post is not about averages. Its about selecting the right metric to track website performance. Standard deviation would surely qualify the avg. latency a bit, but it would still be a pretty lame alternative to using a better toplevel metric like Apdex.
Best, Mike
WHen he was writing quake, they could trade off between lighting fast graphics (40fps+, on a 486) 99% of the time with the occasional horrible slowdown to less than 5fps. vs a steady frame rate that never changed much, but wasn't terribly fast.
Turns out people notice the occasional horrible lag much more than when things perform uniformly.
When tuning a performance critical service, focus on the outliers.
Also, back to the topic of the article at hand, I hope that their "T" is not really two seconds. That is already way too slow for most web purposes.
Good points on why average latency is a bad metric, and while the idea behind Apdex was good, it never ended up being the right measure. The Apdex score still depends on a HiPPO (Highest Paid Person''s Opinion) to determine what T should be, and this can change over time.
At SOASTA (and previously at LogNormal), we borrowed the concept of LD50 (the median lethal dose) from biology. The LD50 value has the property of adapting to what your audience thinks rather than what your HIPPO thinks is a good experience.
We described the method at the Velocity conferences (Santa Clara and London) last year, and wrote it up in a blog post here: http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-a...
Hope you find it interesting.
I should also mention that it's useful to apply some kind of smoothing to timeseries data (like latency over time). Holt-Winters double-exponential smoothing is particularly good at this. What it does is smooth out temporary glitches and show you when things turn unexpectedly bad. If you've ever received a page and said, "Oh yeah, that one... that goes away in 3 seconds. Happens every day.", then you'll find this useful. H-W D-E smoothing only shows you the ones that don't go away after 3 seconds.