Why Average Latency Is a Terrible Way To Track Website Performance (opens in new tab)

(mvolo.com)

40 pointsmvolo13y ago30 comments

30 comments

26 comments · 8 top-level

aetherson13y ago· 7 in thread

TL;DR: Average anything is a terrible way to track anything. (And median or mode are bad, too). Any single-scalar value that compresses information that is best expressed as a graph (or multiple graphs!) is immensely lossy to the point where arguably it obfuscates more than it makes clear.

Back when we had to live with sort of printing-press methods of displaying information (ie, where anything that wasn't pure text was very difficult to display), mean/median/mode numbers were a necessary evil. But if you're looking at a computer screen, there's really no reason to subject yourself to an abstraction that throws out 90% of your data.

mjn13y ago

This was one of the more interesting realizations when I was an undergraduate writing my first research paper. We were testing latency of MIDI interfaces, and after sanity checking by looking at some of the underlying data, realized that average, or even average+stddev, was obscuring a lot of stuff. For example, note-to-note consistency is a major issue in music interfaces, often more important than absolute latency, since the spacing between notes is very important to melody perception (games often have a similar issue).

Showing the full histogram isn't a full solution either, though. Not only does using the average latency obscure the issue by boiling it down to a single scalar, but the full histogram of latencies also loses the information on note-to-note consistency! That's because a latency histogram loses sequencing information, so it doesn't distinguish between the case where you had a lot of 20ms latencies in a row followed by a lot of 50ms latencies in a row, and the case where every other message oscillated between 20ms and 50ms latencies (much worse). You can try to capture some of that information by making a histogram of adjacent-latency deltas, as one attempt. Or you can capture a different view on it by plotting latency vs. time and looking for spikes (but that can obscure less-obvious trends, and is unwieldy as a data representation if you're trying to summarize a system's behavior over a period of hours).

The paper is here, though the actual numbers are 9 years old at this point, so probably not that useful: http://www.cs.hmc.edu/~bthom/res/midi_timing/publications/IC...

stephencanon13y ago

> Average anything is a terrible way to track anything.

Came here to say exactly this. And averages are especially insidious when used for data that doesn't have a symmetric distribution, like most latencies.

mvoloOP13y ago

hi Steve,

Author here. I think most people on HN would echo your sentinment about averages wholesale ... But I wanted to go a little deeper into selecting a better alternative for operational monitoring.

Its easy to say "averages are bad" but harder to say "use X instead", and explain why. We tried. Do you think we did it?

1 more reply

jacques_chester13y ago

Additional standard statistics like mode, median, quartiles etc are really useful.

And you can always throw things into gnuplot to get a quick, exploratory look at things. It will at least give you sense of whether you're looking at a normal distribution, something skewed, multi-modal distributions etc etc.

mvoloOP13y ago

Hi, author here.

I am in complete agreement. Unfortunately, a lot of monitoring and APM tools still lead with average response time as one of the toplevel metrics. And a lot of people still make incorrect assumptions based on it.

Although, the percentile on average latency is not great either. I try to make the case for using a metric that counts acceptable experiences vs. their latency value, e.g. the Apdex index or our derived sat score.

Best, Mike

sp33213y ago

I think The Tech Report has my favorite benchmark graphs. They sort the data points by latency so you can intuitively see the distribution of your samples. e.g. http://techreport.com/review/24022/does-the-radeon-hd-7950-s...

rm99913y ago

I almost completely agree with you. I often tell people that statistics is the study of compressing information in useful ways. That said, scalar statistics can be very useful if the compression is 'correct'. For example, if you have an a priori reason to believe a distribution will be gaussian (a very common situation, and an assumption that basically allowed statistics to be grow to where it is today), mean and variance will fully describe the distribution. Many other common distributions can be fully described by a small number of parameters.

taylorbuley13y ago· 3 in thread

One problem I have with this approach is that it requires you to pick a threshold after which the response is "too slow." This number can change a lot over the course of an application lifetime, and would be hard to pick objectively anyways.

Median latency -- perhaps with (the smoothing-effect of) a rolling median -- would be more robust to outliers without having to resort to hardcoding of "too slow" thresholds. It would still require the human to connect the dots (e.g. median latency of >200 is "too slow") but it's an improvement on mere average response time for reasons noted.

RyanZAG13y ago

This actually calls for advances in analytics packages. If you could specify the threshold on a per page (or regular expression for complex requests), and have the system track and notify you of threshold exceptions, this wouldn't be too difficult to manage with some nice sliders.

Sounds like a good problem for an analytics startup to tackle.

pla3rhat3r13y ago

I agree with this. It's hard to gauge what is acceptable because it really depends on the application. So many other dependencies when dealing with latency and how it effects performance.

mvoloOP13y ago

hi, author here.

Unfortunately, you HAVE TO do it. If you do not set a threshold for what is acceptable, how do you determine whether or not your are providing an acceptable experience to your users?

No amount of aggregate metrics can help you answer this question unless you know whats acceptable, and what isnt - for each important set of URLs in your app.

I agree that its "hard" to do. In our own product (https://www.leansentry.com), we solve this problem by grouping urls, and using good defaults / making it easy to override the thresholds for users.

1 more reply

Jabbles13y ago· 3 in thread

I thought looking at the 99th (or other) percentile was pretty standard practice?

acdha13y ago

Depends on standard - within the clued-in performance community, yes, but there are major, major companies still pushing averages and that causes a lot of people, particularly those without much stats / engineering background, to expect it everywhere.

To use one example which is prevalent throughout marketing, advertising, etc. Google Analytics reports only averages – this makes the results unreliable enough that I'm now advising people to simply pretend that field does not exist as it's completely untrustworthy. Awhile back I blogged about an example where 3 samples out of 200K threw the average off by a full order of magnitude: http://chris.improbable.org/2012/05/18/google-analytics-dece...

Jabbles13y ago

Very interesting, thank you. I especially like the replies from the Google analytics team, 8 months apart, that both acknowledge the issue and say they'll fix it...

1 more reply

mvoloOP13y ago

hi Jabbles,

Author here. The 99 or 95 percentile is a much better metric! We also make the case for the industry standard Apdex or our derived metric, sat score. These are becoming more and more in use by APM tools like us or New Relic.

Unfortunately, many existing tools and people who use them still look at latency aggregates and often make incorrect assumptions.

edouard123456713y ago· 2 in thread

I think there's something to be said about keeping some key metrics super simple so that "everybody" can understand without having to refer to a formula or arbitrarily set thresholds. I've been using 99 and 90 percentile avg performance. It captures enough information in most cases and doesn't require any explanation.

mvoloOP13y ago

Hi edouard,

I completely agree! Keeping toplevel metrics SIMPLE is the key. Of course, simple but also not misleading you into any wrong beliefs.

While we liked the 95 percentile approach, we decided against it. Its still too focused on the actual response time itself, which we thought was less relevant than the number of users experiencing bad performance.

I think for us the bottom line was:

A) If you are having a site-wide performance issues, 95% percentile is a good metric.

B) However, if you have more isolated issues (we find this happens more often to more mature sites), satisfaction score is better.

Best, Mike

taproot13y ago

Im seeing a lot of "averages are bad" etc but I think you come closest to what I had in mind: there isnt anything inherently wrong with using simple metrics. The caveat is you just need to keep in mind and understand their limitations and where they fall down. I think a lot of people understand that using 99 or 95 percentiles and what not but just failed to lay the reasoning out.

nateabele13y ago· 2 in thread

Searched the page for "standard deviation". Didn't find it. Hit the back button.

rm99913y ago

Standard deviation isn't the problem, skew is. Yes, skew will increase the standard deviation, but the heart of the issue here is how fat the right tail of the distribution is.

Standard deviation is often a useful metric, but it's at least as flawed as mean in skewed distributions because it doesn't treat either direction around the (already flawed) mean any differently.

mvoloOP13y ago

hi nateabele,

Author here. Did you also search for stdev, st.dev, variance? Just kidding.

The post is not about averages. Its about selecting the right metric to track website performance. Standard deviation would surely qualify the avg. latency a bit, but it would still be a pretty lame alternative to using a better toplevel metric like Apdex.

Best, Mike

spitfire13y ago· 1 in thread

Michael Abrash talked about this in his black book of graphics programming.

WHen he was writing quake, they could trade off between lighting fast graphics (40fps+, on a 486) 99% of the time with the occasional horrible slowdown to less than 5fps. vs a steady frame rate that never changed much, but wasn't terribly fast.

Turns out people notice the occasional horrible lag much more than when things perform uniformly.

When tuning a performance critical service, focus on the outliers.

jamesaguilar13y ago

I think you should not ignore either. By default, think about 99%ile and 50%ile when tuning and optimizing. Depending on the context (e.g. games), even 99%ile might not be enough, or you might want to think about 99%ile of what? Frames? Scenes? Seconds of gameplay?

Also, back to the topic of the article at hand, I hope that their "T" is not really two seconds. That is already way too slow for most web purposes.

bluesmoon13y ago

Posted this comment on the article, but thought it would be useful here as well:

Good points on why average latency is a bad metric, and while the idea behind Apdex was good, it never ended up being the right measure. The Apdex score still depends on a HiPPO (Highest Paid Person''s Opinion) to determine what T should be, and this can change over time.

At SOASTA (and previously at LogNormal), we borrowed the concept of LD50 (the median lethal dose) from biology. The LD50 value has the property of adapting to what your audience thinks rather than what your HIPPO thinks is a good experience.

We described the method at the Velocity conferences (Santa Clara and London) last year, and wrote it up in a blog post here: http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-a...

Hope you find it interesting.

I should also mention that it's useful to apply some kind of smoothing to timeseries data (like latency over time). Holt-Winters double-exponential smoothing is particularly good at this. What it does is smooth out temporary glitches and show you when things turn unexpectedly bad. If you've ever received a page and said, "Oh yeah, that one... that goes away in 3 seconds. Happens every day.", then you'll find this useful. H-W D-E smoothing only shows you the ones that don't go away after 3 seconds.

joshfraser13y ago

I prefer histograms... http://highscalability.com/blog/2012/5/23/averages-web-perfo...

j / k navigate · click thread line to collapse

30 comments

26 comments · 8 top-level

aetherson13y ago· 7 in thread

mjn13y ago

The paper is here, though the actual numbers are 9 years old at this point, so probably not that useful: http://www.cs.hmc.edu/~bthom/res/midi_timing/publications/IC...

stephencanon13y ago

> Average anything is a terrible way to track anything.

Came here to say exactly this. And averages are especially insidious when used for data that doesn't have a symmetric distribution, like most latencies.

mvoloOP13y ago

hi Steve,

Author here. I think most people on HN would echo your sentinment about averages wholesale ... But I wanted to go a little deeper into selecting a better alternative for operational monitoring.

Its easy to say "averages are bad" but harder to say "use X instead", and explain why. We tried. Do you think we did it?

1 more reply

jacques_chester13y ago

Additional standard statistics like mode, median, quartiles etc are really useful.

mvoloOP13y ago

Hi, author here.

Best, Mike

sp33213y ago

rm99913y ago

taylorbuley13y ago· 3 in thread

RyanZAG13y ago

Sounds like a good problem for an analytics startup to tackle.

pla3rhat3r13y ago

I agree with this. It's hard to gauge what is acceptable because it really depends on the application. So many other dependencies when dealing with latency and how it effects performance.

mvoloOP13y ago

hi, author here.

Unfortunately, you HAVE TO do it. If you do not set a threshold for what is acceptable, how do you determine whether or not your are providing an acceptable experience to your users?

No amount of aggregate metrics can help you answer this question unless you know whats acceptable, and what isnt - for each important set of URLs in your app.

I agree that its "hard" to do. In our own product (https://www.leansentry.com), we solve this problem by grouping urls, and using good defaults / making it easy to override the thresholds for users.

1 more reply

Jabbles13y ago· 3 in thread

I thought looking at the 99th (or other) percentile was pretty standard practice?

acdha13y ago

Jabbles13y ago

Very interesting, thank you. I especially like the replies from the Google analytics team, 8 months apart, that both acknowledge the issue and say they'll fix it...

1 more reply

mvoloOP13y ago

hi Jabbles,

Unfortunately, many existing tools and people who use them still look at latency aggregates and often make incorrect assumptions.

edouard123456713y ago· 2 in thread

mvoloOP13y ago

Hi edouard,

I completely agree! Keeping toplevel metrics SIMPLE is the key. Of course, simple but also not misleading you into any wrong beliefs.

I think for us the bottom line was:

A) If you are having a site-wide performance issues, 95% percentile is a good metric.

B) However, if you have more isolated issues (we find this happens more often to more mature sites), satisfaction score is better.

Best, Mike

taproot13y ago

nateabele13y ago· 2 in thread

Searched the page for "standard deviation". Didn't find it. Hit the back button.

rm99913y ago

Standard deviation isn't the problem, skew is. Yes, skew will increase the standard deviation, but the heart of the issue here is how fat the right tail of the distribution is.

Standard deviation is often a useful metric, but it's at least as flawed as mean in skewed distributions because it doesn't treat either direction around the (already flawed) mean any differently.

mvoloOP13y ago

hi nateabele,

Author here. Did you also search for stdev, st.dev, variance? Just kidding.

Best, Mike

spitfire13y ago· 1 in thread

Michael Abrash talked about this in his black book of graphics programming.

Turns out people notice the occasional horrible lag much more than when things perform uniformly.

When tuning a performance critical service, focus on the outliers.

jamesaguilar13y ago

Also, back to the topic of the article at hand, I hope that their "T" is not really two seconds. That is already way too slow for most web purposes.

bluesmoon13y ago

Posted this comment on the article, but thought it would be useful here as well:

We described the method at the Velocity conferences (Santa Clara and London) last year, and wrote it up in a blog post here: http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-a...

Hope you find it interesting.

joshfraser13y ago

I prefer histograms... http://highscalability.com/blog/2012/5/23/averages-web-perfo...

j / k navigate · click thread line to collapse