> The Normal Distribution is your best guess if you only know the mean and the variance of your data.
This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way. And, the reason why you chose to summarize your data that way is in order to get the normal distribution as the maximum entropy distribution.
The normal distribution appears in a lot of places because it is the limiting case of many other distributions, this is the central limit theorem. It is very easy to work with the normal distribution because you can add or subtract a bunch of normal distributions and the result is just another normal distribution. You can add or subtract a bunch of other distributions and the resulting distribution will often be more normal. You can do a lot of work with the normal distribution using linear algebra techniques.
So, you choose to measure mean and variance in order to make the math easier. This does not always result in the best outcome. For example, if you need more robust statistics, you might go for median and average deviation, rather than mean and variance. Then when you choose the maximum entropy distribution from the result, you end up with the Laplace distribution. The Laplace distribution is very inconvenient to work with mathematically, unlike the normal distribution.
This is non-trivial; it means that we have an online algorithm which sends our measurements directly to our summaries, without worrying about how detailed our measurements are (how many samples are summarized). For comparison, taking the median/quantiles/percentiles/etc. requires either fixed-size buckets which lose precision (as seen in Prometheus histograms) or around O(log log n) space [1], which is still practical but pedantically not constant.
[0] https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...
No, it's not... A Gaussian is the best way to represent your knowledge of a value if you only know the mean and variance of its value.
So if you start with a stack of data and compress it down to a mean and variance, you've discarded most of your knowledge, and are left with a Gaussian as your best guess representation.
Yes if you were to boil it down to different summary data, like a max and min, you'd end up with a different state of knowledge and a different distribution.
But given a mean and variance, the Gaussian is your best choice of distribution, and not because of the central limit theorem, but because it has maximum entropy on those constraints. You don't always even have access to the source data in the first place, maybe just the summary statistics.
> I was extremely confused as to why the Normal (Gaussian) Distribution pops up everywhere—in kurtotically-ignorant financial market analysis, in nature, everywhere. Thinking about it, the prevalence of the Gaussian is actually rather abnormal. Can you guess why it’s everywhere?
This is not a "compression of data" question. It's not an "uninformed distributional choice" question.
It's a "why is this distribution prevalent in Nature" question.
In this context, I think the CLT gives a better answer. There are a lot of averaging processes in Nature, and due to the CLT, averaging of independent perturbations must give rise to normal distributions.
It's possible to perhaps go a step deeper than the above. In some physical systems, you can look at the second moment as an energy -- like the voltage-squared in electrical systems.
In this case, due to a-priori finiteness of system energy, the gaussian distribution can make a claim to being "inevitable" by the maxent argument in OP. ("In a system characterized by finite energy E, what is the least informative distributional constraint?")
The subject of my comment is, "Why are we given mean and variance?" If you take, "We are given mean and variance" as a presupposition, then you are having a different conversation.
The big problem with the maximum entropy argument is that if you apply some transformation to your data, you will end up with a different maximum entropy distribution. For example, you may choose to express your data in terms of rate (events / time) or period (time / events). Maximum entropy won't help you here, you have to have some kind of theoretical understanding of the underlying process that justifies your choice.
The same is true for normal distributions and mean / variance, but it's such an "obvious" choice that people forget to justify their models. My experience is that the premise of the CLT is much easier to justify, and you can use that to support your use of the normal distribution.
And why are they the only things you know? Most often, because they are the things you asked for previously... rather than, as OP hinted, e.g. some facts about quantiles.
There's another interpretation of the mean (and conditional expectation) as quantities minimizing squared error. It's not surprising that squared error and variance are so similar and that these are connected.
Now they tell you they maintained a constant speed throughout the trip, a speed somewhere between 20 and 40mph. How confident are you your friend was driving faster than 30mph?
The principle of maximum entropy, applied to each, gives you different answers. P(30mph) = 0.5 implies the trip takes 2hr40mins, not 3hrs. What gives? Which is the real way we should formulate travel times?
See: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability) Credit for this example: Michael Titelbaum
You can get around this by choosing a non-max entropy prior, like [1], or by deciding on the One True Formulation for your problem. But (Bayesian) updating on the other formulations of the problem won't do it, because there isn't any information in the other formulations -- they're equivalent (by def).
Consider a measurement of some uncertainly sized cubes. You could describe them with their edge length or their volume. Learning one tells you the other. They're equivalent data. However a maximum entropy distribution on one isn't a maximum entropy distribution on the other.
Pragmatically, there's always something you can do (e.g. a Jeffreys prior), but philosophically, this has always made me uneasy with justifications about max entropy that don't also have justifications of the choice of coordinate system.
Personally I'm of the view that the Kullback-Leibler divergence which is defined for arbitrary probability measures (with no special treatment for continuous ones) and which is independent of the choice of coordinates is the true measure of information.
Its downside is that you can only compare 2 distributions that way. For the discrete case you can just pick the uniform distribution as your non-informative base. The issues with the entropy definition for continuous distributions boil down to the problem of picking a uniform distribution for the real numbers.
I fail to see the motivation of this step, and I think that's preventing me to see the argument as "intuitive". Could somebody explain?
The two steps back (info ∝ 1/p) it still makes sense to me: the more rare the event is, the bigger the resulting number is, so in the case the event happens, the more "surprised" we are, and more information is gained. However, what do we achieve by weighing the bitcount of the information with the probability?
That's why we need to take a middle ground between very common events that aren't surprising, and gain us hardly anything, and rare events that gain us a lot of information, but happen so rarely that they don't matter a whole lot.
The formula derived here manages to find the balance between these two extremes.
Remember that if x is a random variable, then its expected value is
E[x] = Sum_{All x} x p(x)
The interpretation is that E[x] is the average value of x. In particular, if we observe a bunch of x's one after the other, call them x_i, then the sample mean S_n = (1/n) Sum_i x_i
which for a given "n" is random, converges to the deterministic constant E[x] as "n" gets large.In this sense, the above formula for E[x] is "inevitable".
In the above case, the thing being averaged is "x", but the same holds true for x^2 ("what is the average value of x^2 ?"), or, in general, f(x) for any function f(.).
In this case, we're using the rather unusual function
f(x) = log_2(1/p(x)),
but the same intuition holds. It's the average number of bits needed to encode x, for instance.minor correction: 1957
This is a more detailed introduction to the subject (from 1962): https://bayes.wustl.edu/etj/articles/brandeis.pdf
That's awful advice for some domains. If your process dynamics are badly behaved (statistically), such as power laws and likes, it turns out the "mean" and "variance" you're calculating from samples are probably rubbish.
Choosing a starting distribution is actually a statement on how you're exposing yourself to risk, there is no such thing as "best guess".
My only advice is to end with a list of maximum entropy distributions to showcase the many applications of this theory. I often refer to such tables when I have varying constraints and want the best choice for representing the spread of the data.
See the table in https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...
Plus as said by klodolph the choice of arbitrarily restricting your knowledge to the mean and to the variance as summary statistics will lead to the Gaussian distribution. Moreover in practice restricting arbitrarily your knowledge is a violation of probability as a model of intuition as showed by Jaynes