## November 2, 2008

### Maximum Entropy Distributions

#### Posted by David Corfield

While we eagerly await Tom’s second post on Entropy, Diversity and Cardinality, here’s a small puzzle for you to ponder. If all of the members of that family of entropies he told us about are so interesting, why is it that so many of our best loved distributions are maximum entropy distributions (under various constraints) for Shannon entropy?

For example, the normal distribution, $N(\mu, \sigma^2)$, is the maximum Shannon entropy distribution for distributions over the reals with mean $\mu$ and variance $\sigma^2$. And there are others, including exponential and uniform (here) and Poisson and Binomial (here). So why no famous distributions maximising Tsallis or Renyi entropy? (People do look at maximising these, e.g., here.)

Follow up question: has this property of the normal distribution anything to do with the central limit theorem? This is relevant.

Posted at November 2, 2008 5:43 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1840

### Re: Maximum Entropy Distributions

As someone who until recently had no clue about entropy, let alone maximizing it, I’d like to share this. It’s a pleasant account of how, if you wanted to model the behaviour of an expert translator of French (e.g. if you wanted to train a machine to do it?), you’d naturally run into the concept of maximum entropy.

As it says, the maximum entropy method is this:

given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible.

Posted by: Tom Leinster on November 3, 2008 12:19 AM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

The translation example is a nice one.
It does however also raise some interesting questions.

The translation of a term or word is highly dependent on context, of course. This is something that will be supposedly picked up from a corpus of examples but it is clear that there is not even a many-many matching between words and phrases between two languages. Somehow there is a local-to-global matching involved. Taking this into account some workers in translation and text analysis seem to use methods adapted from compression theory (PPM and related methods). They match at various levels character-word-phrase-textblock etc. matching patterns of longer and longer length, including flags for parsing information etc. I do not know the theory at all well, but have discussed it on many occasions with a colleague, Bill Teahan.

My point in mentioning this is that the actual measures that these text analysis/categorisation practioners use are based on (crossed) entropy at the various textual levels. This always seemed to me to be not that distant from the idea of hierarchical system and thus of categorical ideas. I know this is vague and it is not at all clear to me how to approach such a measure in a sensible way. Perhaps Tom has a clearer sense of what might be done.

PS. for PPM see

http://en.wikipedia.org/wiki/Prediction_by_Partial_Matching

and for the entropy link the two papers,

# W.J. Teahan, Probability estimation for PPM.

# T. Schürmann and P. Grassberger, Entropy estimation of symbol sequences, Chaos, Vol. 6, pp. 414-427, September 1996.

Posted by: Tim Porter on November 3, 2008 7:37 AM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

based on (crossed) entropy at the various textual levels

Crossed modules I know, crossed complexes yes, but crossed entropy……???

Posted by: Tom Leinster on November 3, 2008 8:02 AM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

Pinched from

http://en.wikipedia.org/wiki/Cross_entropy

In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the “true” distribution p.

I found the term in one of the PPM based papers, so am not claiming I understand it!!!!!

I do know that in text categorisation they use a predictive algorithm, which given the text predicts the next textblock (i.e. character word etc depending at what level the analysis is working at that time). My guess is that asthe various levels are distributed differently, the various predictive values from the levels are somehow compared. Again I do not know this theory merely having talked with a researcher in an effort to try to apply a categorical viewpoint.

Posted by: Tim Porter on November 3, 2008 9:18 AM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

So in terms of our discussion, the cross entropy between $P$ and $Q$ is your expected surprise if you believe the world is distributed according to $Q$, but really it is $P$.

Posted by: David Corfield on November 3, 2008 9:42 AM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

I think that is probably a good interpretation.

Posted by: Tim Porter on November 3, 2008 2:06 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

I am probably talking something that you already know, so feel free to ignore my comments if you think they are not relevant. :)

Cross entropy is a very important concept in information theory and some people think even that a maximum entropy principle should be replaced by a maximum cross-entropy. You can see a detailed argumentation about that in this book draft: http://arxiv.org/abs/0808.0012

It is also known as the Kullback-Leibler divergence, which is a measure of distance between two probability distributions (although it is not symmetric) with the very interesting property that for two parametric distributions with infinitely close parameters p and p+dp, it becomes the Fisher information metrics.

I hope this may help somehow…

Posted by: DisorderedBrain on November 5, 2008 3:11 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

Tim wrote:

I do know that in text categorisation they use a predictive algorithm…

Let’s apply categorification to categorization, and confuse the world!

Posted by: John Baez on November 3, 2008 10:08 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

From notes I see I once knew that there’s a nice geometric interpretation of the distributions that maximise Tsallis entropy here. If you place a uniform distribution on the $n$-sphere of radius $\sqrt{n}$, and project onto $m$ of the co-ordinates, Poincaré had observed that in the limit as $n$ tends to infinity, this $m$-vector is distributed according to an $m$-variate Gaussian distribution with unit covariance matrix. Now these Gaussian distributions are maximisers of the Shannon entropy with this covariance matrix. Some of the Tsallis MaxEnt distributions come from projecting $m$ dimensions from a uniform distribution over an $n$-sphere. You can pick up distributions corresponding to different covariance matrices by looking at uniform distributions over corresponding ellipsoids.

Posted by: David Corfield on November 3, 2008 4:58 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

David wrote, challengingly,

If all of the members of that family of entropies he told us about are so interesting […]

If you’re making the sceptical point I think you are, then I probably agree. My belief is that, in some sense, Shannon entropy $H_1$ has prime position among the family $H_\alpha$ ($\alpha \geq 0$) of entropy measures that I discussed. The others are still interesting, but not, I think, as important.

By way of analogy, there are many interesting invariants of topological spaces that can be defined via homology: the rank of the $3$rd homology group, for instance. But in some sense it’s the Euler characteristic that has primacy: it’s the invariant that behaves most like cardinality.

Certainly Shannon entropy has properties not shared by the $\alpha$-entropies $H_\alpha$ for $\alpha \neq 1$. Indeed, Rényi made exactly this point when he introduced $H_\alpha$. In particular, he observed that while $H_\alpha$ shares with $H_1$ the property that the entropy of a product is the sum of the entropies, it does not share the property that the entropy of a convex combination $\lambda \mathbf{p} + (1 - \lambda)\mathbf{q}$ can be expressed in terms of $\lambda$, $H(\mathbf{p})$ and $H(\mathbf{q})$.

In Part 2 (coming soon!), I’ll define $\alpha$-entropy, $\alpha$-diversity and $\alpha$-cardinality of finite probability spaces in which the underlying set is equipped with a metric. Curiously, in this extended setting it’s the case $\alpha = 2$ that seems to be best-understood. But I’m sure $\alpha = 1$ must play a special role.

Continuing David’s sentence:

[…] why is it that so many of our best loved distributions are maximum entropy distributions (under various constraints) for Shannon entropy?

A priori, that doesn’t exclude the possibility that they’re also maximum entropy distributions for other entropies $H_\alpha$.

As I understand it, you’re mostly referring to distributions on the real line, and the task is to find the distribution having maximum entropy subject to certain constraints (e.g. ‘mean must be $0$ and standard deviation must be $1$’). But let’s go back to a much simpler case: distributions on a finite set, subject to no constraints. As we saw in Part 1, the distribution with maximum entropy is the uniform distribution — and that’s true for $\alpha$-entropy, no matter what $\alpha$ is. So in this case, the distribution that maximizes the $\alpha$-entropy is the same for all $\alpha$.

I’ll talk more about maximizing entropy in Part 2. There seem to be some unsolved problems.

Posted by: Tom Leinster on November 3, 2008 11:32 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

The Pareto distribution is equivalent to a q-exponential, that can be obtained by extremising Tsallis entropy.

Posted by: DisorderedBrain on November 5, 2008 2:23 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

Interesting. This is explained for generalized Pareto distributions here.

Posted by: David Corfield on November 5, 2008 2:53 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

Extremizing a Rényi entropy can be used to yield a power-law distribution, if one breaks out the Lagrange multipliers and maximizes a functional with something like a fixed-average-energy condition,

$L_q(p) = \frac{1}{1-q}\log\sum_i p_i^q -\alpha\sum_i p_i - \beta\sum_i p_i E_i.$

See arXiv:cond-mat/0402404, for example. Under a different condition, when a covariance matrix is specified, maximizing the Rényi entropy gives a Student’s-$t$ distribution (arXiv:math.PR/0507400).

Posted by: Blake Stacey on November 8, 2008 12:28 AM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

Regarding David’s last question, this is probably the best answer.

Regarding the observation typically ascribed to Poincare which David mentioned here, the history seems a little murky. I haven’t tried checking original sources myself, but this paper, which proves a stronger version of the statement, claims that it can’t be found in Poincare’s writings.

Posted by: Mark Meckes on November 5, 2008 8:13 PM | Permalink | Reply to this
Read the post Entropy, Diversity and Cardinality (Part 2)
Weblog: The n-Category Café
Excerpt: Continuation of a discussion of entropy and cardinality and its relevance to ecology
Tracked: November 7, 2008 12:58 PM

### Re: Maximum Entropy Distributions

Here’s a guess.

Shannon entropy can be derived as the limit of a distribution’s multiplicity as the elements increase to infinity.

http://en.wikipedia.org/wiki/Maximum_entropy#The_Wallis_derivation

Out of all the distributions you are choosing one that can exist in the most number of ways given the constraints. Its like the mean of all possible distributions.

To understand what the other distributions are maximizing exactly one should look at their discrete forms before the limit is taken.

Posted by: Jonathan Fischoff on November 17, 2008 8:43 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

Whoops. I meant to say “mode” where I said “mean” above. My bad.

Posted by: Jonathan Fischoff on November 18, 2008 9:27 PM | Permalink | Reply to this

### Re: Maximum Entropy Distributions

A paper on Maximum Entropy on Compact Groups.

On a compact group the Haar probability measure plays the role as uniform distribution. The entropy and rate distortion theory for this uniform distribution is studied. New results and simplified proofs on convergence of convolutions on compact groups are presented and they can be formulated as entropy increases to its maximum. Information theoretic techniques and Markov chains play a crucial role. The rate of convergence is shown to be exponential. The results are also formulated via rate distortion functions.

Posted by: David Corfield on January 5, 2009 11:30 AM | Permalink | Reply to this

Post a New Comment