Maximum Entropy Intuition for Fundamental Statistical Distributions (opens in new tab)

(longintuition.com)

95 pointsyetanothermonk5y ago45 comments

45 comments

35 comments · 10 top-level

klodolph5y ago· 7 in thread

> Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.

> The Normal Distribution is your best guess if you only know the mean and the variance of your data.

This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way. And, the reason why you chose to summarize your data that way is in order to get the normal distribution as the maximum entropy distribution.

The normal distribution appears in a lot of places because it is the limiting case of many other distributions, this is the central limit theorem. It is very easy to work with the normal distribution because you can add or subtract a bunch of normal distributions and the result is just another normal distribution. You can add or subtract a bunch of other distributions and the resulting distribution will often be more normal. You can do a lot of work with the normal distribution using linear algebra techniques.

So, you choose to measure mean and variance in order to make the math easier. This does not always result in the best outcome. For example, if you need more robust statistics, you might go for median and average deviation, rather than mean and variance. Then when you choose the maximum entropy distribution from the result, you end up with the Laplace distribution. The Laplace distribution is very inconvenient to work with mathematically, unlike the normal distribution.

Kednicma5y ago

You are spot-on. I would just add that there's another relatively beautiful reason why, in practice, folks pick the mean and variance for their summary; it can be done online with live data, for O(1) time and space! [0] If we extend this idea, then we easily get kurtosis and skew as the next two moments, again in constant time and space, and again getting a normal distribution but now with skew and squish.

This is non-trivial; it means that we have an online algorithm which sends our measurements directly to our summaries, without worrying about how detailed our measurements are (how many samples are summarized). For comparison, taking the median/quantiles/percentiles/etc. requires either fixed-size buckets which lose precision (as seen in Prometheus histograms) or around O(log log n) space [1], which is still practical but pedantically not constant.

[0] https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...

[1] https://tmc.web.engr.illinois.edu/lb.ps

jbay8085y ago

> This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way.

No, it's not... A Gaussian is the best way to represent your knowledge of a value if you only know the mean and variance of its value.

So if you start with a stack of data and compress it down to a mean and variance, you've discarded most of your knowledge, and are left with a Gaussian as your best guess representation.

Yes if you were to boil it down to different summary data, like a max and min, you'd end up with a different state of knowledge and a different distribution.

But given a mean and variance, the Gaussian is your best choice of distribution, and not because of the central limit theorem, but because it has maximum entropy on those constraints. You don't always even have access to the source data in the first place, maybe just the summary statistics.

mturmon5y ago

I would like to push back against this in favor of the original comment. The context of this remark within the article is:

> I was extremely confused as to why the Normal (Gaussian) Distribution pops up everywhere—in kurtotically-ignorant financial market analysis, in nature, everywhere. Thinking about it, the prevalence of the Gaussian is actually rather abnormal. Can you guess why it’s everywhere?

This is not a "compression of data" question. It's not an "uninformed distributional choice" question.

It's a "why is this distribution prevalent in Nature" question.

In this context, I think the CLT gives a better answer. There are a lot of averaging processes in Nature, and due to the CLT, averaging of independent perturbations must give rise to normal distributions.

It's possible to perhaps go a step deeper than the above. In some physical systems, you can look at the second moment as an energy -- like the voltage-squared in electrical systems.

In this case, due to a-priori finiteness of system energy, the gaussian distribution can make a claim to being "inevitable" by the maxent argument in OP. ("In a system characterized by finite energy E, what is the least informative distributional constraint?")

1 more reply

klodolph5y ago

> But given a mean and variance, ...

The subject of my comment is, "Why are we given mean and variance?" If you take, "We are given mean and variance" as a presupposition, then you are having a different conversation.

The big problem with the maximum entropy argument is that if you apply some transformation to your data, you will end up with a different maximum entropy distribution. For example, you may choose to express your data in terms of rate (events / time) or period (time / events). Maximum entropy won't help you here, you have to have some kind of theoretical understanding of the underlying process that justifies your choice.

The same is true for normal distributions and mean / variance, but it's such an "obvious" choice that people forget to justify their models. My experience is that the premise of the CLT is much easier to justify, and you can use that to support your use of the normal distribution.

conjectures5y ago

> if you only know the mean and variance of its value.

And why are they the only things you know? Most often, because they are the things you asked for previously... rather than, as OP hinted, e.g. some facts about quantiles.

yetanothermonkOP5y ago

Very good points. The Normal Distribution has very nice properties, which adds to its popularity, which, if I understand you correctly, adds to the significance we place on mean and variance.

adenadel5y ago

I wouldn't say that the existence of the normal distribution is necessarily the reason that we place significance on the mean and variance. The mean is incredibly natural to define when you move to measure theoretic probability (simply the integral of a function, i.e. a random variable, with respect to some measure). When you take this point of view, random variables with particular moments existing are simply functions in L^p. Further, when you move on to proving the CLT the moments give you properties of the characteristic functions that allow you to prove the CLT. These are all deep connections.

There's another interpretation of the mean (and conditional expectation) as quantities minimizing squared error. It's not surprising that squared error and variance are so similar and that these are connected.

1 more reply

jmoss205y ago· 5 in thread

Thought experiment: suppose your friend drives 80 miles to visit you. They tell you the trip took between 2 and 4 hours. You have no further information. How confident are you the trip took less than 3 hours?

Now they tell you they maintained a constant speed throughout the trip, a speed somewhere between 20 and 40mph. How confident are you your friend was driving faster than 30mph?

The principle of maximum entropy, applied to each, gives you different answers. P(30mph) = 0.5 implies the trip takes 2hr40mins, not 3hrs. What gives? Which is the real way we should formulate travel times?

See: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability) Credit for this example: Michael Titelbaum

lambdatronics5y ago

https://bayes.wustl.edu/etj/articles/ambiguity.pdf

fractionalhare5y ago

This paradox is a good motivator for when Bayesian probability is a useful. Your confidence is a posterior probability which is conditioned on some prior information. Initially you have little prior information, except for an interval of time and distance. When you receive information about the derivative of speed throughout the trip, this meaningfully updates your priors, and so the posterior changes.

jmoss205y ago

The upshot here is that choosing the max entropy distribution as your prior isn't enough, you also need to choose some particular way to formulate the problem. Particular formulations (travel time vs. speed, here) imply different max entropy priors, even though the formulations are equivalent. Worse, there are infinite equivalent formulations, all with different implied max entropy priors.

You can get around this by choosing a non-max entropy prior, like [1], or by deciding on the One True Formulation for your problem. But (Bayesian) updating on the other formulations of the problem won't do it, because there isn't any information in the other formulations -- they're equivalent (by def).

[1]: https://en.wikipedia.org/wiki/Jeffreys_prior

cttet5y ago

There are more information provided when you mentioned "they maintained a constant speed throughout the trip".

jmoss205y ago

How would you update on it?

(But you could just as well s/constant speed/average speed/, I don't think it makes a difference.)

ianhorn5y ago· 4 in thread

One thing I'd add to this is that this kind of thinking makes your coordinate system really matter.

Consider a measurement of some uncertainly sized cubes. You could describe them with their edge length or their volume. Learning one tells you the other. They're equivalent data. However a maximum entropy distribution on one isn't a maximum entropy distribution on the other.

Pragmatically, there's always something you can do (e.g. a Jeffreys prior), but philosophically, this has always made me uneasy with justifications about max entropy that don't also have justifications of the choice of coordinate system.

canjobear5y ago

This seems important. Is there somewhere where this example is worked out in more detail?

lambdatronics5y ago

http://bayes.wustl.edu/etj/articles/prior.pdf

1 more reply

blackbear_5y ago

Could this be solved by putting a max-entropy distribution on their joint probability?

contravariant5y ago

Their joint distribution is degenerate, so it becomes a bit unclear how to even define entropy.

Personally I'm of the view that the Kullback-Leibler divergence which is defined for arbitrary probability measures (with no special treatment for continuous ones) and which is independent of the choice of coordinates is the true measure of information.

Its downside is that you can only compare 2 distributions that way. For the discrete case you can just pick the uniform distribution as your non-informative base. The issues with the entropy definition for continuous distributions boil down to the problem of picking a uniform distribution for the real numbers.

1 more reply

GolDDranks5y ago· 4 in thread

"And if we weigh this by the probability of that particular event happening, we get info ∝ p ⋅ log2(1/p)"

I fail to see the motivation of this step, and I think that's preventing me to see the argument as "intuitive". Could somebody explain?

The two steps back (info ∝ 1/p) it still makes sense to me: the more rare the event is, the bigger the resulting number is, so in the case the event happens, the more "surprised" we are, and more information is gained. However, what do we achieve by weighing the bitcount of the information with the probability?

GolDDranks5y ago

Ah, I think I got it. The point of the exercise is not to formulate the concept of "amount of suprise (∝ amount of information gained) IN CASE the event happens" but the "EXPECTED amount of entropy gain", for us to know before it happens.

That's why we need to take a middle ground between very common events that aren't surprising, and gain us hardly anything, and rare events that gain us a lot of information, but happen so rarely that they don't matter a whole lot.

The formula derived here manages to find the balance between these two extremes.

mturmon5y ago

I would agree w/ your statement here. The entropy is the on-average (or, expected) amount of information gained from seeing one "x".

mturmon5y ago

The weighting with probability turns it in to an expected value.

Remember that if x is a random variable, then its expected value is

  E[x] = Sum_{All x} x p(x)

The interpretation is that E[x] is the average value of x. In particular, if we observe a bunch of x's one after the other, call them x_i, then the sample mean

  S_n = (1/n) Sum_i x_i

which for a given "n" is random, converges to the deterministic constant E[x] as "n" gets large.

In this sense, the above formula for E[x] is "inevitable".

In the above case, the thing being averaged is "x", but the same holds true for x^2 ("what is the average value of x^2 ?"), or, in general, f(x) for any function f(.).

In this case, we're using the rather unusual function

  f(x) = log_2(1/p(x)),

but the same intuition holds. It's the average number of bits needed to encode x, for instance.

GolDDranks5y ago

Thanks. I was stuck with the idea that the whole argument was trying to formulate just the "amount of information gained | event happens", when it actually formulated the expected entropy gain.

canjobear5y ago· 2 in thread

You can derive these distributions with a lot less algebra by characterizing them with invariances, rather than maximum entropy under constraints.

https://stevefrank.org/reprints-pdf/16Entropy.pdf

yetanothermonkOP5y ago

Very cool! Thank you for sharing. We can get to these distributions a bunch of ways, and I find every incremental way to look at something, the better you understand it. Now, I’m about to get nerdsniped by symmetry and invariances.

yetanothermonkOP5y ago

What’s the next best step or resource for learning about symmetry & invariances?

martopix5y ago· 1 in thread

With this method, you can derive all of statistical mechanics from information theory with constraints originated from thermodynamics. The observation of thermodynamic quantities, which are high level observations on particles (i.e. related to means, etc., and not to individual particles), puts constraints of the same kind as the ones listed in this article. This approach was pioneered by Jaynes (1952) "Information theory and statistical mechanics, I": https://www.semanticscholar.org/paper/Information-Theory-and...

kgwgk5y ago

> This approach was pioneered by Jaynes (1952)

minor correction: 1957

This is a more detailed introduction to the subject (from 1962): https://bayes.wustl.edu/etj/articles/brandeis.pdf

carlosf5y ago· 1 in thread

> The Normal Distribution is your best guess if you only know the mean and the variance of your data.

That's awful advice for some domains. If your process dynamics are badly behaved (statistically), such as power laws and likes, it turns out the "mean" and "variance" you're calculating from samples are probably rubbish.

Choosing a starting distribution is actually a statement on how you're exposing yourself to risk, there is no such thing as "best guess".

yetanothermonkOP5y ago

I’m making no statement on what your priors are, just that if you have the mean and variance, the max entropy distribution is the Normal. If you know skew and kurtosis, you’ll pick something else

jostmey5y ago· 1 in thread

I love the article!

My only advice is to end with a list of maximum entropy distributions to showcase the many applications of this theory. I often refer to such tables when I have varying constraints and want the best choice for representing the spread of the data.

See the table in https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...

yetanothermonkOP5y ago

Thank you very much! Great idea!!

Nesco5y ago

This approach can mislead people because it by design make the hypothesis that the support is infinite and that the variance is finite, which is why it ends in a thin tail distribution in the first place.

Plus as said by klodolph the choice of arbitrarily restricting your knowledge to the mean and to the variance as summary statistics will lead to the Gaussian distribution. Moreover in practice restricting arbitrarily your knowledge is a violation of probability as a model of intuition as showed by Jaynes

abelaer5y ago

Logistic regression can actually also be interpreted as a maximum entropy distribution after observing some 'training data'.

j / k navigate · click thread line to collapse

45 comments

35 comments · 10 top-level

klodolph5y ago· 7 in thread

> Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.

> The Normal Distribution is your best guess if you only know the mean and the variance of your data.

Kednicma5y ago

[0] https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...

[1] https://tmc.web.engr.illinois.edu/lb.ps

jbay8085y ago

> This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way.

No, it's not... A Gaussian is the best way to represent your knowledge of a value if you only know the mean and variance of its value.

So if you start with a stack of data and compress it down to a mean and variance, you've discarded most of your knowledge, and are left with a Gaussian as your best guess representation.

Yes if you were to boil it down to different summary data, like a max and min, you'd end up with a different state of knowledge and a different distribution.

mturmon5y ago

I would like to push back against this in favor of the original comment. The context of this remark within the article is:

This is not a "compression of data" question. It's not an "uninformed distributional choice" question.

It's a "why is this distribution prevalent in Nature" question.

It's possible to perhaps go a step deeper than the above. In some physical systems, you can look at the second moment as an energy -- like the voltage-squared in electrical systems.

1 more reply

klodolph5y ago

> But given a mean and variance, ...

The subject of my comment is, "Why are we given mean and variance?" If you take, "We are given mean and variance" as a presupposition, then you are having a different conversation.

conjectures5y ago

> if you only know the mean and variance of its value.

And why are they the only things you know? Most often, because they are the things you asked for previously... rather than, as OP hinted, e.g. some facts about quantiles.

yetanothermonkOP5y ago

Very good points. The Normal Distribution has very nice properties, which adds to its popularity, which, if I understand you correctly, adds to the significance we place on mean and variance.

adenadel5y ago

1 more reply

jmoss205y ago· 5 in thread

Now they tell you they maintained a constant speed throughout the trip, a speed somewhere between 20 and 40mph. How confident are you your friend was driving faster than 30mph?

See: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability) Credit for this example: Michael Titelbaum

lambdatronics5y ago

https://bayes.wustl.edu/etj/articles/ambiguity.pdf

fractionalhare5y ago

jmoss205y ago

[1]: https://en.wikipedia.org/wiki/Jeffreys_prior

cttet5y ago

There are more information provided when you mentioned "they maintained a constant speed throughout the trip".

jmoss205y ago

How would you update on it?

(But you could just as well s/constant speed/average speed/, I don't think it makes a difference.)

ianhorn5y ago· 4 in thread

One thing I'd add to this is that this kind of thinking makes your coordinate system really matter.

canjobear5y ago

This seems important. Is there somewhere where this example is worked out in more detail?

lambdatronics5y ago

http://bayes.wustl.edu/etj/articles/prior.pdf

1 more reply

blackbear_5y ago

Could this be solved by putting a max-entropy distribution on their joint probability?

contravariant5y ago

Their joint distribution is degenerate, so it becomes a bit unclear how to even define entropy.

1 more reply

GolDDranks5y ago· 4 in thread

"And if we weigh this by the probability of that particular event happening, we get info ∝ p ⋅ log2(1/p)"

I fail to see the motivation of this step, and I think that's preventing me to see the argument as "intuitive". Could somebody explain?

GolDDranks5y ago

The formula derived here manages to find the balance between these two extremes.

mturmon5y ago

I would agree w/ your statement here. The entropy is the on-average (or, expected) amount of information gained from seeing one "x".

mturmon5y ago

The weighting with probability turns it in to an expected value.

Remember that if x is a random variable, then its expected value is

  E[x] = Sum_{All x} x p(x)

The interpretation is that E[x] is the average value of x. In particular, if we observe a bunch of x's one after the other, call them x_i, then the sample mean

  S_n = (1/n) Sum_i x_i

which for a given "n" is random, converges to the deterministic constant E[x] as "n" gets large.

In this sense, the above formula for E[x] is "inevitable".

In the above case, the thing being averaged is "x", but the same holds true for x^2 ("what is the average value of x^2 ?"), or, in general, f(x) for any function f(.).

In this case, we're using the rather unusual function

  f(x) = log_2(1/p(x)),

but the same intuition holds. It's the average number of bits needed to encode x, for instance.

GolDDranks5y ago

Thanks. I was stuck with the idea that the whole argument was trying to formulate just the "amount of information gained | event happens", when it actually formulated the expected entropy gain.

canjobear5y ago· 2 in thread

You can derive these distributions with a lot less algebra by characterizing them with invariances, rather than maximum entropy under constraints.

https://stevefrank.org/reprints-pdf/16Entropy.pdf

yetanothermonkOP5y ago

What’s the next best step or resource for learning about symmetry & invariances?

martopix5y ago· 1 in thread

kgwgk5y ago

> This approach was pioneered by Jaynes (1952)

minor correction: 1957

This is a more detailed introduction to the subject (from 1962): https://bayes.wustl.edu/etj/articles/brandeis.pdf

carlosf5y ago· 1 in thread

> The Normal Distribution is your best guess if you only know the mean and the variance of your data.

Choosing a starting distribution is actually a statement on how you're exposing yourself to risk, there is no such thing as "best guess".

yetanothermonkOP5y ago

I’m making no statement on what your priors are, just that if you have the mean and variance, the max entropy distribution is the Normal. If you know skew and kurtosis, you’ll pick something else

jostmey5y ago· 1 in thread

I love the article!

See the table in https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...

yetanothermonkOP5y ago

Thank you very much! Great idea!!

Nesco5y ago

abelaer5y ago

Logistic regression can actually also be interpreted as a maximum entropy distribution after observing some 'training data'.

j / k navigate · click thread line to collapse