undefined | Better HN

0 pointskqr6y ago0 comments

This reads like a circular argument to me. I'm saying that the MLE is not a very useful measure of location for fat tailed distributions, since extreme values contribute so much to the expectation.

0 comments

2 comments · 1 top-level

gjm116y ago· 1 in thread

I think we could do with being a bit more explicit about some distinctions here. We have:

1. the observations we make

from which we can derive

2. statistics like the sample mean, the sample median, etc.

which may or may not give us useful information about

3. the actual underlying distribution

which has

4. properties like its mean (= expectation), median, etc.

which may or may not give us useful information about

5. the _future_ behaviour of the system in question.

We probably care a lot about extreme values in #5, but depending on what the actual distribution is and what we know about it a priori, the right way to predict extreme values in #5 may or may not involve paying a lot of attention to extreme values in #1.

If, to take a very heavy-tailed example, we know that the actual distribution's PDF is proportional to 1/(1+(x-m)^2) for some m, then the sample mean is no more informative than any single sample; if we have a lot of samples and want to estimate m, then we have to pay much less attention to outliers than we do when calculating the mean or even the median.

Note that here m is the expectation of both #3 and #5.

For heavy-tailed distributions,it may well be that the "obvious" location parameter is not a good measure of the expectation of what we actually care about, e.g. because small changes make no difference to speak of but large ones can put us out of business. In that case, what we want under #4 might actually be something like "the 99th centile". But that still doesn't mean that outlying values in #1 are a good indicator of those.

Here's an actual example. I generated 100 samples from a Cauchy distribution with PDF proportional to 1/(1+x^2). (So its actual mean is zero.) I computed the max-likelihood parameters of the distribution using those samples; they turned out to correspond to a PDF proportional to 1/(1+((x-a)/b)^2 where a=0.2240 and b=1.095. Here a is the mean of the distribution and b is a scale parameter.

I then threw in an extra 1-in-10000 outlying point at x=3183 and computed the new max-likelihood parameters. They had a=0.2235 and b=1.121. So (1) an extra large positive outlier moves our idea of the mean further to the left (though only a little) but (2) it quite correctly increases our idea of how broad the distribution is.

What's the next effect of that on, say, our prediction of the 99th centile? It goes from 35.08 with the "real" data to 35.91 with the extra outlier. That's an increase but a rather small one. (The "correct" figure -- i.e., the actual 99th centile of the distribution I really drew the samples from -- is 31.8, so we've overestimated it a bit even without the extra spurious outlier.)

If instead we took the maximum of the samples (without my spurious extra outlier) as a guide to likely future events, we'd have got 40.5, a much worse approximation than we got by finding max-likelihood parameter estimates.

For comparison, I did the same with a normal distribution with mean 0 and standard deviation 1. ML parameter estimates from my 100 samples were mean 0.078 and standard deviation 0.969; actual 99th centile is 2.326, estimated 99th centile is 2.332, max of 100 samples is 2.994, a much worse estimate.

If we throw in a 1-in-10000 outlier (at x=3.719) then our estimated 99th centile moves to 2.508, a much larger change than for the Cauchy distribution relative either to the estimates themselves or to the value of the outlier we added in. It's also further out "in probability": that is, CDF(actual distribution, perturbed 99th centile) is bigger for the normal than for the Cauchy distribution.

In other words, it looks as if even if what we care about is predicting outlying values, (1) we probably don't do well to use outliers in our data directly -- that gives us extremely noisy estimates for fat-tailed distributions; and (2) the impact on our predictions of any given outlying value will, if we do it right, typically be smaller for fat-tailed distributions than for thin-tailed ones.

BUT a cautionary note: there is an important thing missing in the experiments I did. I found point estimates of the actual distribution and worked from those. It would be better to find probability distributions over actual distribution parameters and use those. You could e.g. do that with MCMC methods. Throwing in outliers that don't actually come from the true underlying distribution will tend to make your posterior distribution broader, which means that if you compute e.g. a posterior 99th centile value then it will be larger. I am fairly sure that conclusions #1 and #2 above will still hold, perhaps more strongly than given the simpler analysis I did.

kqrOP6y ago

Thank you! This is a very well-reasoned comment that I will find myself returning to also in the future.

j / k navigate · click thread line to collapse

0 comments

2 comments · 1 top-level

gjm116y ago· 1 in thread

I think we could do with being a bit more explicit about some distinctions here. We have:

1. the observations we make

from which we can derive

2. statistics like the sample mean, the sample median, etc.

which may or may not give us useful information about

3. the actual underlying distribution

which has

4. properties like its mean (= expectation), median, etc.

which may or may not give us useful information about

5. the _future_ behaviour of the system in question.

Note that here m is the expectation of both #3 and #5.

kqrOP6y ago

Thank you! This is a very well-reasoned comment that I will find myself returning to also in the future.

j / k navigate · click thread line to collapse