## November 7, 2008

### Entropy, Diversity and Cardinality (Part 2)

#### Posted by David Corfield Guest post by Tom Leinster

What happened last time?

Ecologically, I explained some ways to measure the diversity of an ecosystem — but only taking into account abundance (how many of each species there are), not similarity (how much the various species have in common). Mathematically, I explained how to assign a number to a finite probability space, measuring how uniform the distribution is.

What’s happening now?

Ecologically, we’ll see how to measure diversity in a way that takes into account both abundance and similarity. So, for instance, an ecosystem consisting of three species of eel in equal numbers will count as less diverse than one consisting of equal numbers of eels, eagles and earwigs. Mathematically, we’ll see how to assign a number to any finite ‘probability metric space’ (a set equipped with both a probability distribution and a metric).

By thinking about diversity, we’ll be led to the concept of the cardinality of a metric space. (Some ecologists got there years ago.) And by building on progress made at this blog, we’ll solve some open problems posed in the ecology literature.

Previously

In the first post, we modelled an ecosystem as a ‘finite probability space’ — that is, a finite sequence $(p_1, \ldots, p_n)$ of non-negative reals summing to $1$, with $p_i$ representing the abundance of the $i$th species. The post was all about ways of measuring such systems. We found three infinite families of measures, each indexed by a real parameter $\alpha \geq 0$; true to the title, they were the ‘$\alpha$-entropy’, ‘$\alpha$-diversity’ and ‘$\alpha$-cardinality’.

In a sense it doesn’t matter which family you use, since each determines the others. For instance, if you tell me what the $3.7$-entropy of your ecosystem is, I can tell you what its $3.7$-diversity and $3.7$-cardinality are. But sometimes the context makes one more convenient than the others.

I’ll quickly summarize the definitions. There were lots of formulas last time, but in fact everything follows from this one alone: for each $\alpha \geq 0$, the ‘surprise function’ $\sigma_\alpha: [0, 1] \to [0, \infty]$ is defined by $\sigma_\alpha(p) = \frac{1}{\alpha - 1} (1 - p^{\alpha - 1}).$ Here I’m assuming that $\alpha \neq 1$. I’ll come back to $\alpha = 1$ later.

The $\alpha$-diversity of a finite probability space $A = (p_1, \ldots, p_n)$ is the expected surprise, $D_\alpha(A) = \sum_{i = 1}^n p_i \sigma_\alpha(p_i).$ It takes its minimum value, $0$, when the distribution is concentrated at one point (that is, some $p_i$ is $1$), and its maximum value, $\sigma_\alpha(1/n)$, when the distribution is uniform ($p_1 = \cdots = p_n = 1/n$).

The $\alpha$-cardinality is defined in terms of the $\alpha$-diversity in such a way that its minimum value is $1$ (‘essentially just one species’) and its maximum is $n$ (‘all species fully represented’). This pretty much forces us to define the $\alpha$-cardinality of $A$ by $|A|_\alpha = 1/\sigma_\alpha^{-1} (D_\alpha(A)).$

Finally, the $\alpha$-entropy is just the logarithm of the $\alpha$-cardinality: $H_\alpha(A) = \log |A|_\alpha.$

To handle $\alpha = 1$, take limits: $\sigma_1(p) = \lim_{\alpha \to 1} \sigma_\alpha (p)$, $D_1(A) = \lim_{\alpha \to 1} D_\alpha (A)$, etc. With a bit of care, you can handle $\alpha = \infty$ similarly.

I’ve said all that in as formula-free a way as I can. But from what I’ve said, you can easily derive explicit formulas for diversity, cardinality and entropy, and work out the especially interesting cases $\alpha = 0, 1, 2, \infty$. For ease of reference, here it all is. $\begin{matrix}   \\   &   & \alpha-diversity,  D_\alpha &   & \alpha-cardinality,  |  |_\alpha &   & \alpha-entropy,  H_\alpha \\   \\ General \alpha \neq 1, \infty && \frac{1}{\alpha - 1} \left( 1 - \sum p_i^\alpha \right) && \left( \sum p_i^\alpha \right)^{1/(1 - \alpha)} && \frac{1}{1 - \alpha} \log \sum p_i^\alpha \\   \\ \alpha = 0 && n - 1 && n && \log n \\   \\ \alpha = 1 && -\sum p_i \log p_i && \prod p_i^{-p_i} && -\sum p_i \log p_i \\   && (Shannon entropy)&& (cardinality)&& (Shannon entropy)\\   \\ \alpha = 2 && 1 - \sum p_i^2 && 1/\sum p_i^2 && - \log \sum p_i^2 \\   && (Simpson diversity)&& (reciprocal Simpson)&&   \\   \\ \alpha = \infty && 0 && 1/\max p_i && -\log \max p_i \\ && && (reciprocal Berger–Parker)&& \\   \\ Minimum value && 0 && 1 && 0 \\ Maximum value && \sigma_\alpha(1/n) && n && \log n \end{matrix}$

Let me draw your attention to two important properties of cardinality:

• Multiplicativity  $|A \times B|_\alpha = |A|_\alpha \times |B|_\alpha$ for all $\alpha \geq 0$ and probability spaces $A$ and $B$.
• Invariant max and min  For most probability spaces $A$, the cardinality $|A|_\alpha$ changes with $\alpha$. Let’s call $A$ invariant if, on the contrary, $|A|_\alpha$ is independent of $\alpha \in (0, \infty)$.

Fix a cardinality $n \geq 1$ for our underlying set. Then for all $\alpha$, the maximum value of $|A|_\alpha$ is $n$, and for all $\alpha$, this is attained at the uniform distribution $A = (1/n, \ldots, 1/n)$. In particular, the uniform distribution is invariant. Similarly, for all $\alpha$, the minimum value of $|A|_\alpha$ is $1$, and for all $\alpha$, this is attained at any $A$ of the form $(0, \ldots, 0, 1, 0, \ldots, 0)$. In particular, $(0, \ldots, 0, 1, 0, \ldots, 0)$ is invariant.

In fact, the invariant probability spaces can be neatly classified:

Theorem  Let $n \geq 1$. Let $I$ be a nonempty subset of $\{1, \ldots, n\}$, and define $(p_1, \ldots, p_n)$ by $p_i = \left\{ \begin{matrix} 1/|I| &if  i \in I \\ 0 &otherwise. \end{matrix} \right.$ Then $(p_1, \ldots, p_n)$ is invariant. Moreover, every invariant probability space arises in this way.

The two cases just mentioned are where $I$ has $1$ or $n$ elements.

The cardinality of metric spaces: a lightning introduction

We’ll use the concept of the ‘cardinality’ of a finite metric space (which is not the same thing as the cardinality of its set of points!) Here I’ll tell you everything you need to know about it in order to follow this post. If you want more, try these slides; skip the first few if you don’t know any category theory. If you want more still, try the discussion here at the Café.

Here goes. Let $A$ be a finite metric space with points $a_1, \ldots, a_n$; write $d_{i j}$ for the distance from $a_i$ to $a_j$. (We allow $d_{i j} = \infty$.) Let $Z$ be the $n \times n$ matrix with $Z_{i j} = e^{-d_{i j}}$. A weighting on $A$ is a column vector $\mathbf{w}$ such that $Z \mathbf{w} = \left( \begin{matrix} 1 \\ \vdots \\ 1 \end{matrix} \right).$ The cardinality of the metric space $A$, written $|A|_{MS}$, is the sum $\sum_{i = 1}^n w_i$ of the entries of $\mathbf{w}$. This is independent of the choice of weighting $\mathbf{w}$, so cardinality is defined as long as at least one weighting exists.

Examples  If $A$ is empty then $|A|_{MS} = 0$. If $A$ has just one point then $|A|_{MS} = 1$. If $A$ has two points, distance $d_{1 2} = d$ apart, then $|A|_{MS} = 1 + \tanh(d).$ So when $d = 0$ we have $|A|_{MS} = 1$; think of this as saying ‘there is effectively only one point’. As the points move further apart, $|A|_{MS}$ increases, until when $d = \infty$ we have $|A|_{MS} = 2$; think of this as saying ‘there are two completely distinguished points’. More generally, if $A$ is an $n$-point space with $d_{i j} = \infty$ whenever $i \neq j$, then $|A|_{MS} = n$. You can think of cardinality as the ‘effective number of points’.

Usually $Z$ is invertible. In that case there is a unique weighting, and $|A|_{MS}$ is the sum of all $n^2$ entries of $Z^{-1}$. There are, however, examples of finite metric spaces for which no weighting exists; then the cardinality is undefined.

Notes for the learned  1. Previously I’ve used $e^{-2d_{i j}}$ rather than $e^{-d_{i j}}$. The difference amounts to rescaling the metric by a factor of $2$, so is fairly harmless. In the present context the reason for inserting the $2$ seems not to be very relevant, so I’ve dropped it.

2. I’m writing $|A|_{MS}$, rather than the customary $|A|$, to emphasize that it’s the cardinality of $A$ as a metric space. It’s not, in general, the same as the cardinality of the underlying set, or the cardinality (Euler characteristic) of the underlying topological space. (Moral: forgetful functors don’t preserve cardinality.) This will be important later.

3. Lawvere has argued that the symmetry axiom should be dropped from the definition of metric space. I’m assuming throughout this post that our metric spaces are symmetric, but I think everything can be done without the symmetry axiom.

Outline

We now refine our model of an ecosystem by taking similarity between species into account. A (finite) probability metric space is a finite probability space $(p_1, \ldots, p_n)$ together with a metric $d = (d_{i j})$ on $\{1, \ldots, n\}$. I’ll often denote a probability metric space by $A = (p_1, \ldots, p_n; d)$.

The distances $d_{i j}$ are to be interpreted as a measure of the dissimilarity between species. There’s a biological decision to be made as to how exactly to quantify it. It could be percentage genetic difference, or difference in toe length, or how far up the evolutionary tree you have to go before you find a common ancestor, ….. Anyway, I’ll assume that the decision has been made.

Note that in the definition of probability metric space, there’s no axiom saying that the probability and metric structure have to be compatible in any way. This reflects ecological reality: just because your community contains three-toed sloths, there’s no guarantee that you’ll also find two-toed sloths.

Example  In 1875, Alfred Russel Wallace described a journey through a tropical forest:

If the traveller notices a particular species and wishes to find more like it, he may often turn his eyes in vain in every direction. Trees of varied forms, dimensions, and colours are around him, but he rarely sees any one of them repeated. Time after time he goes towards a tree which looks like the one he seeks, but a closer examination proves it to be distinct. He may at length, perhaps, meet with a second specimen half a mile off, or may fail altogether, till on another occasion he stumbles on one by accident.

(Tropical Nature and Other Essays, quoted in the paper of Patil and Taillie cited in Part 1.) Wallace’s description tells us three things about the ecosystem: it contains many species, most of them are rare, and there are many close resemblances between them. In our terminology, $n$ is large, most $p_i$s are small, and many $d_{i j}$s are small.

An important player in this drama is the quantity $\sum_{j = 1}^n e^{-d_{i j}} p_j$ associated to each point $i$ of a probability metric space. You can interpret it in a couple of ways:

• ‘How much stuff there is near $i$’.  Think of the probability metric space as a metric space with a certain amount $p_i$ of ‘stuff’ piled up at each point $i$. (It’s like our ClustrMap.) Then think of $\sum_j e^{-d_{i j}} p_j$ as a sum of $p_j$s weighted by the $e^{-d_{i j}}$s. The point $i$ itself contributes the full measure $p_i$ of its stuff; a nearby point $j$ contributes most of its measure $p_j$; a far-away point contributes almost nothing.
• ‘Mean closeness to $i$’.  While $d_{i j}$ measures distance, $e^{-d_{i j}}$ measures closeness, on a scale of $0$ to $1$: for points close together it’s nearly $1$, and for points far apart it’s close to $0$. So $\sum_j e^{-d_{i j}} p_j$ is the mean closeness to point $i$.

When we’re thinking ecologically, we might say similarity instead of closeness, so that identical species have similarity $1$ and utterly different species have similarity $0$. Then $\sum_j e^{-d_{i j}} p_j$ is the expected similarity between an individual of species $i$ and an individual chosen at random. So even if species $i$ itself is rare, a great abundance of species similar to $i$ will ensure that $\sum_j e^{-d_{i j}} p_j$ is high.

I’ll call $\sum_j e^{-d_{i j}} p_j$ the density at $i$. The second interpretation makes it clear that density is always in $[0, 1]$.

The metric aspect of a probability metric space is encapsulated in the $n \times n$ matrix $Z = (e^{-d_{i j}})_{i, j}$ mentioned earlier. (You could call $Z$ the similarity matrix.) The probabilistic aspect is encapsulated in the $n$-dimensional column vector $\mathbf{p} = (p_1, \ldots, p_n)$. Their product $Z\mathbf{p}$ encapsulates density: it is the column vector whose $i$th entry is the density at $i$.

In what sense are we generalizing Part 1? Well, we’ve replaced sets by metric spaces, so we need to know how metric spaces generalize sets. This works as follows: given a set $A$, the discrete metric space on $A$ is the metric space in which the points are the points of $A$ and the distance between any two distinct points is $\infty$. This construction is left adjoint to the forgetful functor from (metric spaces and distance-decreasing maps) to sets.

If $A$ is discrete then the similarity matrix $Z$ is the identity, so $Z\mathbf{p} = \mathbf{p}$, i.e. $\sum_j e^{-d_{i j}} p_j = p_i$ for all $i$. Ecologically, discreteness means that distinct species are regarded as completely different: we’re incapable of seeing the resemblance between butterflies and moths.

The plan  We’re going to define, for each $\alpha \geq 0$, the $\alpha$-diversity, $\alpha$-cardinality and $\alpha$-entropy of any probability metric space. We’ll do this by taking the definitions for probability spaces without metrics (Part 1) and judiciously replacing certain occurrences of ‘$p_i$’ by ‘$(Z\mathbf{p})_i$’.

Then things get interesting. For a fixed set, it was easy to find the probability distribution that maximized the cardinality (or equivalently, the diversity or entropy). For a fixed metric space, it’s not so obvious… but the answer has something to do with the cardinality of the metric space.

We’ll then see how to apply what we know about cardinality of metric spaces to some problems in the ecological literature, and finally, we’ll see how the cardinality of metric spaces was first discovered in ecology.

Surprise, revisited

Imagine you’re taking samples from an ecosystem. So far you’ve found one species of toad, two species of frog, and three species of newt. How surprised would you be if the next individual you found was a frog of a different species? Not very surprised, because although it would be the first time you’d seen that particular species in your ecosystem, you’d already seen lots of similar species. You’d have been much more surprised if, for instance, you’d found a penguin.

Imagine you’re studying the vocabulary used by a particular novelist. You’re tediously going through his work, listing all the words he’s used. So far your data indicates that he writes rather formally, favouring long, Latinate words and avoiding slang. How surprised would you be if the next word you found was ‘erubescent’? Not very surprised, even if it was the first time you’d ever come across the word, because you’d already seen lots of similar words in his books. You’d have been much more surprised if the next word had been ‘bummer’.

Both scenarios show that your surprise at an event depends not only on the (perceived) probability of the event itself, but also on the probability of similar events. We’ll interpret this mathematically as follows: instead of our surprise at finding species $i$ being a function of the probability, $p_i$, it will be a function of the density $\sum_j e^{-d_{i j}} p_j = (Z\mathbf{p})_i$.

Recall from Part 1 that a surprise function is a decreasing function $\sigma: [0, 1] \to [0, \infty]$ with $\sigma(1) = 0$. Any surprise function $\sigma$ gives rise to a diversity measure on finite probability metric spaces: the diversity of $A = (p_1, \ldots, p_n; d)$ is the expected surprise, $\sum_{i = 1}^n p_i \sigma((Z\mathbf{p})_i).$ This is just the same as the formula in Part 1 except that $\sigma(p_i)$ has been changed to $\sigma((Z\mathbf{p})_i)$. In the degenerate case of a discrete metric space, where we’re blind to the similarities between species, it’s no change at all.

With almost no further thought, we can now extend the concepts of $\alpha$-diversity, $\alpha$-cardinality and $\alpha$-entropy from probability spaces to probability metric spaces.

Diversity

Let $\alpha \geq 0$. The $\alpha$-diversity of a probability metric space $A = (p_1, \ldots, p_n; d)$ is $D_\alpha(A) = \sum_{i = 1}^n p_i \sigma_\alpha((Z\mathbf{p})_i) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - \sum p_i (Z\mathbf{p})_i^{\alpha - 1} \right) & if \alpha \neq 1 \\ - \sum p_i \log (Z\mathbf{p})_i & if \alpha = 1. \end{matrix} \right.$ In the case of a discrete metric space, this reduces to what we called $\alpha$-diversity in Part 1.

I’ll run through the usual examples, $\alpha \in \{0, 1, 2\}$. It seems that the easiest to understand is $\alpha = 2$, so I’ll start with that.

Example: $\alpha = 2$  We have $D_2(A) = 1 - \sum_{i, j} p_i e^{-d_{i j}} p_j = \mathbf{p}^t \Delta \mathbf{p},$ where $\mathbf{p}^t$ is the transpose of the column vector $\mathbf{p}$ and $\Delta$ is the $n\times n$ matrix given by $\Delta_{i j} = 1 - e^{-d_{i j}}$. How can we understand this? Well, $e^{-d_{i j}}$ is what I called the similarity between species $i$ and $j$, so $\Delta_{i j}$ might be called the difference. It’s an increasing function of $d_{i j}$, measured on a scale of $0$ to $1$. Choose two individuals at random: then $D_2(A)$ is the expected difference between them.

This quantity $D_2(A)$ is known as Rao’s quadratic diversity index, or Rao’s quadratic entropy, named after the eminent statistician C.R. Rao. In its favour are the straightforward probabilistic interpretation and the mathematically useful fact that it’s a quadratic form.

In the degenerate case of a discrete metric space, $D_2$ is Simpson’s diversity index, described in Part 1.

Example: $\alpha = 1$  We have $D_1(A) = - \sum_{i = 1}^n p_i \log(Z\mathbf{p})_i.$ In the discrete case, $D_1$ is Shannon entropy. In the general case, it would be the cross entropy of $\mathbf{p}$ and $Z\mathbf{p}$, except that cross entropy is usually only defined when both arguments are probability distributions, and this isn’t the case for our second argument, $Z\mathbf{p}$.

Example: $\alpha = 0$  Even $0$-diversity is nontrivial: $D_0(A) = \sum_{i = 1}^n \frac{p_i}{(Z\mathbf{p})_i} - 1.$ In the discrete case, $D_0(A)$ is $n - 1$, the ‘species richness’. In general, the $0$-diversity is highest when, for many $i$, the density at $i$ is not much greater than the probability at $i$ — that is, there are not many individuals of species similar to the $i$th. This fits sensibly with our intuitive notion of diversity.

The family $(D_\alpha)_{\alpha \geq 0}$ of diversity measures was introduced here:

Carlo Ricotta, Laszlo Szeidl, Towards a unifying approach to diversity measures: Bridging the gap between the Shannon entropy and Rao’s quadratic index, Theoretical Population Biology 70 (2006), 237–243

…almost. Actually, there’s a subtle difference.

Aside: what Ricotta and Szeidl actually do  Near the beginning of their paper, Ricotta and Szeidl make a convention: that in their metric spaces of species, all distances will be $\leq 1$. This may seem innocuous, if mathematically unnatural, but turns out to be rather significant.

Motivated by ideas about conflict between species, they define, for each $\alpha \geq 0$, a diversity measure for probability metric spaces. At first glance the formula seems different from ours. However, it becomes the same if we rescale the range of allowed distances from $[0, 1]$ to $[0, \infty]$ via the order-preserving bijection $\begin{matrix} [0, \infty] &\leftrightarrow &[0, 1] \\ d &\leftrightarrow &1 - e^{-d}. \end{matrix}$ We’ve already seen the formula $1 - e^{-d}$: earlier, we wrote $\Delta_{i j} = 1 - e^{-d_{i j}}$ and called it the difference between species $i$ and $j$. In other words, the ‘distance’ in Ricotta and Szeidl’s metric space is what I’d call the ‘difference’. It’s easy to show that if the distances $d_{i j}$ define a metric on $\{1, \ldots, n\}$ then so do the differences $\Delta_{i j}$ (but not conversely). The presence of two metrics on the same set can be slightly confusing.

Cardinality and entropy

In Part 1 we applied an invertible transformation to rescale $\alpha$-diversity into something more mathematically convenient, $\alpha$-cardinality. We’ll now apply the same rescaling to obtain a definition of $\alpha$-cardinality in our more sophisticated, metric context. And, as before, we’ll call its logarithm ‘$\alpha$-entropy’.

So, let $\alpha \geq 0$ and let $A = (p_1, \ldots, p_n; d)$ be a probability metric space. Its $\alpha$-cardinality is $|A|_\alpha = 1/\sigma_\alpha^{-1}(D_\alpha(A)) =   \left\{ \begin{matrix} \left( \sum_{i = 1}^n p_i (Z\mathbf{p})_i^{\alpha - 1} \right)^{\frac{1}{1 - \alpha}} & if \alpha \neq 1 \\ \prod_{i = 1}^n (Z\mathbf{p})_i^{-p_i} & if \alpha = 1. \end{matrix} \right.$ Its $\alpha$-entropy is $H_\alpha(A) =   \left\{ \begin{matrix} \frac{1}{1 - \alpha} \log \left(\sum p_i (Z\mathbf{p})_i^{\alpha - 1}\right) & if \alpha \neq 1 \\ - \sum p_i \log (Z\mathbf{p})_i & if \alpha = 1. \end{matrix} \right.$ The $\alpha$-cardinality is always a real number $\geq 1$; the $\alpha$-diversity and $\alpha$-entropy are always real numbers $\geq 0$.

Example: $\alpha = 0$  The $0$-cardinality of a probability metric space is $\sum p_i/(Z\mathbf{p})_i$.

Example: $\alpha = 1$  As in the discrete case, the $1$-cardinality has a special property not shared by the other $\alpha$-cardinalities: that the cardinality of a convex combination $\lambda A + (1 - \lambda)B$ of probability metric spaces is determined by $\lambda$ and the cardinalities of $A$ and $B$. I’ll leave the details to you, curious reader.

Example: $\alpha = 2$  This is the reciprocal of a quadratic form: $|A|_2 = 1/ \mathbf{p}^t Z \mathbf{p}.$

Example: $\alpha = \infty$  The limit $\lim_{\alpha\to\infty} |A|_\alpha$ exists for every probability metric space $A$; its value, the $\infty$-cardinality of $A$, is $|A|_\infty = 1/\max_i (Z\mathbf{p})_i.$

All the $\alpha$-cardinalities are multiplicative, in the following sense. Any two probability metric spaces $A = (p_1, \ldots, p_n; d)$, $B = (q_1, \ldots, q_m; e)$ have a ‘product’ $A \otimes B$. The underlying set is $\{1, \ldots, n\} \times \{1, \ldots, m\}$. The probability of event $(i, j)$ is $p_i q_j$. The metric is the ‘$d_1$ metric’: if $i, i' \in \{1, \ldots, n\}$ and $j, j' \in \{1, \ldots, m\}$ then the distance from $(i, i')$ to $(j, j')$ is $d_{i i'} + e_{j j'}$. (Viewing metric spaces as enriched categories, this is the usual tensor product.) The multiplicativity result is that for all $\alpha$, $A$ and $B$, $|A \otimes B|_\alpha = |A|_\alpha \times |B|_\alpha.$ Equivalently, $H_\alpha$ is log-like: $H_\alpha(A \otimes B) = H_\alpha(A) + H_\alpha(B)$.

Maximizing diversity

When a species becomes extinct, biodiversity is reduced. Less drastically, diversity decreases when a species becomes rare. Conversely, a highly diverse ecosystem is one containing a wide variety of species, all well-represented, with no one species dominant.

Now suppose you’re in charge of a small ecosystem. The set of species is fixed, but you can control the abundance of each species. What should you do to maximize diversity?

Before we seek a mathematical answer, let’s try to get a feel for it. If there are only two species, the answer is immediate: you want them evenly balanced, $50$$50$. What if there are three? In the absence of further information, you’d want $33\frac{1}{3}$% of each. But let’s suppose that two of them are rather similar — say, Persian wheat and Polish wheat — and one is relatively different — say, poppies. Then the $33\frac{1}{3}$$33\frac{1}{3}$$33\frac{1}{3}$ split would mean that your system was dominated two-to-one by wheat, and that doesn’t seem like maximal diversity. On the other hand, if you regarded the two types of wheat as essentially the same, you’d want to split it $25$$25$$50$. And if you acknowledged that the two wheats were different, but not very different, you might choose a compromise such as $30$$30$$40$.

Here’s the mathematical question. Fix a finite metric space $A = (\{1, \ldots, n\}, d)$.

Question  Let $\alpha \geq 0$. What probability distribution on $A$ maximizes the $\alpha$-cardinality, and what is the maximum value?

It doesn’t matter whether the question is posed in terms of $\alpha$-entropy, $\alpha$-diversity or $\alpha$-cardinality, since the three are related by invertible, order-preserving transformations. (For more on maximizing entropy, see David’s post.) And, incidentally, the analogous question for minimum values is easy: the minimum value is $1$, attained when some $p_i$ is $1$.

Rough Answer  Often, the $\alpha$-cardinality is maximized by taking the probability distribution $(w_1/|A|_{MS}, \ldots, w_n/|A|_{MS})$, for any weighting $\mathbf{w}$ on $A$, and the maximum value is the cardinality $|A|_{MS}$ of the metric space $A$.

Thus — finally, as promised! — we see how the concept of the cardinality of a metric space comes naturally from ecological considerations.

• It’s certainly not true in complete generality — in fact, it doesn’t even make sense in complete generality, since not every finite metric space has a well-defined cardinality.
• It’s not always true even when $A$ does have well-defined cardinality, since there are some metric spaces $A$ for which $|A|_{MS} < 1$, and $1$ is the minimum value.
• Furthermore, there are metric spaces $A$ for which there is a unique weighting $\mathbf{w}$, but $w_i/|A|_{MS}  <  0$ for some values of $i$. Then the ‘probability distribution’ in the Rough Answer is not a probability distribution at all.
• And even when $|A|_{MS}$ is defined and $\geq 1$, even when $A$ has a unique weighting $\mathbf{w}$ with $w_i \geq 0$ for all $i$, I don’t know that the Rough Answer is necessarily correct.

I’ll give a series of partial justifications. Immediately, the Rough Answer is correct when $A$ is discrete: then $|A|_{MS}$ is $n$ (the cardinality of the underlying set), the unique weighting is $(1, \ldots, 1)$, and the probability distribution mentioned is the uniform distribution.

Notation  Since we’ve fixed a metric space $A = (\{ 1,\ldots, n\}, d)$ and we’re interested in a varying probability distribution $\mathbf{p} = (p_1, \ldots, p_n)$ on $A$, from now on I’ll write the $\alpha$-cardinality of the probability metric space defined by $A$ and $\mathbf{p}$ as $|\mathbf{p}|_\alpha$.

Fact 1  Let $\mathbf{w}$ be a weighting on $A$ and write $\mathbf{p} = \mathbf{w}/|A|_{MS}$. Then $|\mathbf{p}|_\alpha = |A|_{MS}$, for all $\alpha \geq 0$.

(In order for $\mathbf{p}$ to be a probability distribution, we need all the weights $w_i$ to be non-negative; but even if they’re not, we can still define $|\mathbf{p}|_\alpha$ and the result still holds.)

This is a simple calculation. In particular, the probability distribution $\mathbf{p} = \mathbf{w}/|A|_{MS}$ is invariant: its $\alpha$-cardinality is the same for all $\alpha \in (0, \infty)$. This is a very special property:

Fact 2  Let $I$ be a nonempty subset of $A$ and $\mathbf{w}$ a weighting on $I$ such that $w_i \geq 0$ for all $i$; define a probability distribution $\mathbf{p}$ on $A$ by $p_i = \left\{ \begin{matrix} w_i/|I|_{MS} &if  i \in I \\ 0 &otherwise. \end{matrix} \right.$ Then $\mathbf{p}$ is invariant. Moreover, every invariant probability distribution on $A$ arises in this way.

Given the special case of discrete metric spaces (the Theorem at the end of ‘Previously’), it seems reasonable to view the distribution $\mathbf{w}/|A|_{MS}$ on a metric space as the analogue of the uniform distribution on a set.

Fact 3  Let $\mathbf{w}$ be a weighting on $A$. Then for each $\alpha \geq 0$, the function $\begin{matrix} \{\mathbf{p} \in \mathbb{R}^n: \sum p_i = 1\} & \to & \mathbb{R} \\ \mathbf{p} &\mapsto &|\mathbf{p}|_\alpha \end{matrix}$ has a stationary point (critical point) at $\mathbf{p} = \mathbf{w}/|A|_{MS}$.

Again, this holds for all $\alpha$. I do not know under precisely what circumstances this stationary point is a local maximum, or a global maximum, or whether every stationary point is of this form. The best result I know is this:

Fact 4  Suppose that the similarity matrix $Z$ of $A$ is positive-definite, and that $\alpha \geq 2$. Let $\mathbf{w}$ be the unique weighting on $A$, and suppose that all the weights $w_i$ are non-negative. Then the Rough Answer is correct: the maximum value of $|\mathbf{p}|_\alpha$ is $|A|_{MS}$, attained when $\mathbf{p} = \mathbf{w}/|A|_{MS}$.

For example, suppose that $A$ has only three points, as in the wheat and poppies example. Then $Z$ is always positive-definite and the weights are always non-negative, so the $\alpha$-diversity is maximized by growing the three species in proportion to their weights — no matter what value of $\alpha \geq 2$ we choose. Suppose, for instance, that the two species of wheat are distance $1$ apart, and both are distance $10$ from the poppy species. Then a few calculations with a $3\times 3$ matrix tell us that diversity is maximized by growing $29.7$% Persian wheat, $29.7$% Polish wheat and $40.6$% poppies. This fits with our earlier intuitions.

Further partial results along these lines are in

S. Pavoine, S. Ollier, D. Pontier, Measuring diversity from dissimilarities with Rao’s quadratic entropy: Are any dissimilarities suitable? Theoretical Population Biology 67 (2005), 231–239.

(Again, there doesn’t seem to be a free copy online. Apparently it’s just not the done thing in mathematical ecology. Ecologists of the world, rise up! Throw off your chains!)

They also do some case studies with real data, calculating, for instance, how to maximize the 2-diversity of a system of 18 breeds of cattle. They use the concept of the cardinality of a metric space (not by that name). They weren’t the first ecologists to do so… but I’ll save that story until the end.

Does mixing increase diversity?

We tend to think of high-entropy situations as those in which a kind of blandness has been attained through everything being mixed together, like white noise, or snow on a TV screen, or all the pots of paint combined to make brown. In other words, we expect mixing to increase entropy. We probably expect it to increase diversity too.

This principle is often formalized as ‘concavity’. Suppose you own two neighbouring fields, each containing a hundred farmyard animals. You open a gate between the fields, effectively making one big field containing two hundred animals. If one of the original fields contained just sheep and the other contained just cows, you would have combined two ecosystems of diversity $0$ to make one ecosystem of diversity $> 0$. More generally, concavity says that the diversity of the combined ecosystem is at least as great as the mean of the two original diversities.

To be precise, fix a finite metric space $A = (\{1, \ldots, n\}, d)$ and let $D$ be a function assigning to each probability distribution $\mathbf{p}$ on $\{1, \ldots, n\}$ a ‘diversity’ $D(\mathbf{p}) \in \mathbb{R}$. Then $D$ is concave if for all probability distributions $\mathbf{p}$ and $\mathbf{q}$ on $A$ and all $\lambda \in [0, 1]$, $D(\lambda \mathbf{p} + (1 - \lambda) \mathbf{q}) \geq \lambda D(\mathbf{p}) + (1 - \lambda) D(\mathbf{q}).$ This is seen as a desirable property of diversity measures.

Example  If the metric space $A$ is discrete then the $\alpha$-diversity measure $D_\alpha$ is concave for every $\alpha \geq 0$. (This is in the paper by Patil and Taillie cited in Part 1.)

Question  For a general metric space $A$, is $\alpha$-diversity always concave?

Ricotta and Szeidl, who introduced these diversity measures, were especially interested in the $2$-diversity (Rao’s quadratic index $\mathbf{p}^t \Delta \mathbf{p}$, discussed above). They prove that it’s concave when there are at most $4$ species, and state that when $n \geq 5$, ‘the problem has not been solved’. We can now solve it, at least for $n \geq 6$, by adapting a related counterexample of Tao.

Example  Let $A$ be the $6$-point metric space whose difference matrix $\Delta$ is the symmetric matrix $\left( \begin{matrix} 0 &a &a &b &b &b \\ &0 &a &b &b &b \\ & &0 &b &b &b \\ & & &0 &a &b \\ & & & &0 &b \\ & & & & &0 \end{matrix} \right)$ where $a = 9/25$ and $b = 1/5$. (Recall that $\Delta_{i j} = 1 - e^{-d_{i j}}$.) Let $\mathbf{p} = (0, 0, 0, 1/5, 1/5, 3/5),    \mathbf{q} = (1/5, 1/5, 1/5, 0, 0, 2/5).$ Then $D_2 \left( \frac{1}{2} (\mathbf{p} + \mathbf{q}) \right) = \frac{191}{1250}  <  \frac{192}{1250} = \frac{1}{2} ( D_2(\mathbf{p}) + D_2(\mathbf{q}) ).$ So for this space $A$, Rao’s diversity index is not concave. And we can get a similar example with any given number $n \geq 6$ of points, by adjoining new points at distance $\infty$ from everything else.

Ricotta and Szeidl also proved some partial results on the concavity of the diversity measures $D_\alpha$ for values of $\alpha$ other than $2$. But again, $D_\alpha$ is not always concave:

Fact 5  For all $\alpha \geq 0$, there exists a 6-point metric space whose $\alpha$-diversity measure is not concave.

This can be proved by a construction similar to the one for $\alpha = 2$, but taking different values of $a$ and $b$ according to the value of $\alpha$.

How biologists discovered the cardinality of metric spaces

When Christina Cobbold suggested the possibility of a link between the cardinality of metric spaces and diversity measures, I read the paper she gave me, followed up some references, and eventually found — to my astonishment and delight — that the cardinality of a metric space appeared explicitly in the ecology literature:

Andrew Solow, Stephen Polasky, Measuring biological diversity, Environmental and Ecological Statistics 1 (1994), 95–107.

I don’t know how Solow and Polasky would describe themselves, but probably not as category theorists: Solow is director of the Marine Policy Center at Woods Hole Oceanographic Institution, and Polasky is a professor of ecological and environmental economics.

They came to the definition of the cardinality of a metric space by a route completely different from anything discussed so far. They don’t mention entropy. In fact, they almost completely ignore the matter of the abundance of the species present, considering only their similarity. Here’s an outline of the relevant part of their paper.

Preservation of biodiversity is important, but money for it is scarce. So perhaps it’s sensible, or at least realistic, to focus our conservation efforts on those ecosystems most likely to benefit human beings. As they put it,

The discussion so far has essentially assumed that diversity is desirable and has focused on constructing a measure with reasonable properties. A different view is that it is not so much diversity per se that is valuable, but the benefits that diversity provides.

So they embark on a statistical analysis of the likely benefits of high biodiversity, using, for the sake of concreteness, the idea that every species has a certain probability of providing a cure for a disease. Their species form a finite metric space, and similar species are assumed to have similar probabilities of providing a cure.

With this model set up, they compute a lower bound for the probability that a cure exists somewhere in the ecosystem. The bound is an increasing function of a quantity that they pithily describe as the ‘effective number of species’… and that’s exactly the cardinality of the metric space!

Despite the fact that the cardinality appears only as a term in a lower bound, they realize that it’s an interesting quantity and discuss it for a page or so, making some of the same conjectures that I made again fifteen years later: that cardinality is ‘monotone in distance’ (if you increase the distances in a metric space, you increase the cardinality) and that, for a space with $n \geq 1$ points, the cardinality always lies between $1$ and $n$. Thanks to the help of Simon Willerton and Terence Tao, we now know that both conjectures are false.

Summary

In Part 1 we discussed how to measure a probability space. In Part 2 we discussed how to measure a probability metric space. In both we defined, for each $\alpha \geq 0$, the $\alpha$-diversity of a space (the ‘expected surprise’), the $\alpha$-cardinality (the ‘effective number of points’), and the $\alpha$-entropy (its logarithm).

In Part 1, the uniform distribution played a special role. Given a finite set $A$, the uniform distribution is essentially the only distribution on $A$ with the same $\alpha$-cardinality for all $\alpha  >  0$. (‘Essentially’ refers to the fact that you can also take the uniform distribution on a subset of $A$, then extend by $0$.) Moreover, for every $\alpha  >  0$, it is the distribution with the maximum $\alpha$-cardinality — namely, the cardinality of the set $A$.

In Part 2, where the set $A$ is equipped with a metric, there is in general no distribution with all these properties. However, there’s something that comes close. When $A$ has a non-negative weighting, we have the ‘weight distribution’ on $A$, in which each point of $A$ has probability proportional to its weight. This is essentially the only distribution on $A$ with the same $\alpha$-cardinality for all $\alpha  >  0$. Moreover, under certain further hypotheses, it is also the distribution with the maximum $\alpha$-cardinality — namely, the cardinality of the metric space $A$.

I don’t see the full picture clearly. I don’t know the solution to the general problem of finding the distribution with the maximum cardinality (or entropy, or diversity) on a given metric space. And I don’t know how to incorporate into my intuition the fact that there are some finite metric spaces with negative cardinality, and others with cardinality greater than the number of points.

I also don’t know how one would generalize from distributions on finite metric spaces to distributions on, say, compact or complete metric spaces — but that’s something for another day.

Posted at November 7, 2008 9:45 AM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1845

### Re: Entropy, Diversity and Cardinality (Part 2)

Wow!

(Apologies for a content-free comment, but you know it’s going to take a while for the rest of us to digest all that…)

Posted by: Eugenia Cheng on November 7, 2008 3:17 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

It’s worth pointing out that the $\alpha-$diversity is very nearly the partition function at temperature $1/\alpha$:

(1)$\zeta(\alpha) = \sum p_i^\alpha$

Has anyone come up with the cardinality of a metric space by looking at the distance between microstates?

The partition function of a Turing machine is related to its halting probability:

(2)$\zeta_U(\alpha) = \sum_{n \in \mathbb{N} such that U(n) is defined} n^{-\alpha}$

At “inverse temperature” $\alpha \ge 1$, the value of the partition function is $1/\alpha$-random — at least if $\alpha$ is computable. A universal Turing machine also defines a metric space. Is the $\alpha-$diversity of a UTM concave?

Posted by: Mike Stay on November 7, 2008 7:14 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Mike, this looks very interesting — thanks.

You’ve mentioned a lot of concepts that are new to me. Would you be able to give some brief explanations, or some good links? E.g. what are the ‘halting probability’ and ‘partition function’ of a Turing machine? Thanks.

Posted by: Tom Leinster on November 7, 2008 10:32 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Let $B = \{0, 1\}$ and $S = B^{*}$ be the set of finite binary strings. A universal Turing machine is a partial function

(1)$U:\S \stackrel{\circ}{\to} S$

with the property that for any other Turing machine $M$ (think “programming language”) there is a string $c_M$ (the compiler, which takes programs written in $M$ and turns them into code that can execute on $U$) such that for any program $p$ for $M$, there is a program $q$ for $U$ such that

(2)$U(q) = M(p) \quad and \quad |q| \le |c_M.p|,$

where $c_M.p$ is the concatenation of the compiler and its input and $|\cdot|$ is the length of the string.

It’s a partial function because you can write infinite loops or programs with syntax errors. If $U$ is defined on $p$, we say that $U(p)$ halts.

A prefix-free Turing machine is one in which if a program $p$ halts, then no program of the form $p.q$ halts. The nice thing about prefix-free machines is that you can define a probability space where the probability of a program $p$ is $2^{-|p|}$.

The probability that $U$ halts and outputs the string $x$ is

(3)$P(x) = \sum_{U(p)=x} 2^{-p}$

The probability that $U$ halts—no matter what its output—is called “the halting probability for $U$” or “the Chaitin Omega number for $U$:”

(4)$\Omega_U = \sum_{U(p) halts}2^{-|p|}.$

The idea of Kolmogorov complexity is that $U$ is a language in which you can describe stuff, and (up to a compiler) it doesn’t matter which $U$ you pick. The complexity $K_U(x)$ of a string $x$ is

(5)$K_U(x) = min \{ p | U(p) = x \},$

that is, it’s the number of bits required to describe $x$. From now on, I’ll just assume we’ve picked a universal machine $U$ and leave off the subscripts.

Chaitin’s incompleteness theorem says that when you write out $\Omega$ as a binary fraction, the bits are random: there’s some constant $c$ such that

(6)$K(\Omega[m]) \ge m - c,$

where $\Omega[m]$ is the string consisting of the first $m$ bits of $\Omega$. So any program that outputs $\Omega[m]$ has to be at least $m-c$ bits long; after a certain point, it’s pure information–to get one more bit out, you have to add at least one bit to your program.

(7)$\Omega(D) = \sum_{U(p) halts} 2^{-|p|/D}$

He showed that the value of $\Omega(D)$ is $D$-random when $0 \lt D \le 1$ is computable; that is, at $D=1/2$, you get two bits of the sequence for each bit of program.

I recognized his formula as a kind of zeta function, and in my master’s thesis I defined the zeta function of a Turing machine $U$ as

(8)$\zeta_U(s) = \sum_{U(n) halts} n^{-s}.$

This uses nats instead of bits. I showed that this thing is $1/s$-random at computable $1 \le s$ and proved a bunch of other stuff about it. Tadaki then thought up a statistical ensemble for which this actually is the partition function and computed the randomness of various thermodynamic properties. Like any good partition function, you use it to compute the expectation value of an observable; the expected complexity is the Shannon entropy.

The HALT fractal is what you get when you mark off the unit interval with line segments corresponding to halting programs. It’s fractalline because a universal Turing machine can simulate itself. The geometric zeta function of this fractal is the same as my zeta function. The fractal dimension of such fractals is related to the zeros of the zeta function, and in general they’re complex; these zeros come into play in tube formulas.

Taking complex values of partition functions is Wick rotation, of course, which swaps between statistical mechanics and quantum mechanics. I showed in my thesis that the zeros of the Turing zeta function contain information about which programs halt. It’s unclear what the randomness of the zeta function is at arbitrary complex values; one project I’d like to do is analytically continue the Turing zeta function to the half-plane where the real part is less than 1.

Posted by: Mike Stay on November 8, 2008 1:51 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

By the way, you asked what the partition function (or zeta function) of a Turing machine is; I hope my answer above gives you the idea. But if you haven’t been exposed to partition functions at all, read weeks 216, 217, and 218 of This Week’s Finds.

Posted by: Mike Stay on November 8, 2008 9:04 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Thanks for all this. I was lucky enough to do a good course on computability theory when I was an undergraduate, but of course it didn’t go nearly as far as the stuff you’re talking about. And as you guessed, I haven’t really been exposed to partition functions. I’ll get reading.

Posted by: Tom Leinster on November 8, 2008 10:01 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Here’s a six-sentence introduction to partition functions:

The founding fathers of statistical mechanics, Boltzmann and Gibbs, realized that in thermal equilibrium at temperature $T$, a physical system having various states $i$ with energies $E_i$ will be in the $i$th state with a probability proportional to

$exp(- E_i / k T )$

where $k$ is a fundamental constant of nature called Boltzmann’s constant. So, in thermal equilibrium, a system is very unlikely to be in a state whose energy is much greater than $k T$, while states of energy much less than $k T$ are almost equally likely. This is why water is ice when it’s cold but steam when it’s hot. But the above numbers don’t add up to 1, so to get actual probabilities we need to divide them by

$Z(T) = \sum_i exp(- E_i / k T )$

This is called the partition function at temperature $T$. An amazing amount of information can be recovered from this function!

Posted by: John Baez on November 9, 2008 2:20 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Thanks, John.

Simon has just arrived for a few days’ stay in Glasgow. This afternoon we sat in a café (a real café!) and discussed partition functions. We figured out various cool things. I’m sure they’re totally standard, but they were new to me, so now I have the urge to explain.

First, let me abstract a bit from what you wrote. Your starting data is a finite family $(E_1, \ldots, E_n)$ of non-negative real numbers. From this you build a one-parameter family $(A(\beta))_{\beta \geq 0}$ of measure spaces with underlying set $\{1, \ldots, n\}$. (My $\beta$ is your $1/k T$.) For each $\beta \geq 0$, the measure space $A(\beta)$ is defined by taking the measure of $i \in \{1, \ldots, n\}$ to be $exp(-E_i \beta).$ The partition function $Z$ simply tells you the total measure of each space; that is, $Z(\beta)$ is the total measure of $A(\beta)$. (Now I’m using the notation slightly differently from you, since you have $Z$ as a function of $T$ rather than $\beta$.)

Normalizing, this one-parameter family of measure spaces gives you a one-parameter family of probability spaces, $(\bar{A}(\beta))_{\beta \geq 0}$. Simon showed me how to read off the cardinality of these probability spaces from the partition function. (As in Part 1, I’m using ‘cardinality’ to mean the exponential of the entropy.) The formula is $|\bar{A}(\beta)| = Z(\beta) \cdot exp \left( -\beta \frac{d}{d \beta} \log Z(\beta) \right).$ More generally, the partition function tells us the $\alpha$-cardinality of these probability spaces for every $\alpha\geq 0$: $|\bar{A}(\beta)|_\alpha = \left( \frac{Z(\alpha\beta)}{Z(\beta)^\alpha} \right)^{1/1 - \alpha}$ ($\alpha \neq 1$).

One last thought. You say, John, that an ‘amazing amount of information’ can be recovered from the partition function. But it seems to me that you could have put it more strongly than that: all information can be recovered from the partition function. In other words, the partition function uniquely determines the starting data — the finite family $(E_1, \ldots, E_n)$. (The limit as $T \to \infty$ of $Z(T)$ is the number $n$ of states; then, I guess, if you evaluate $Z(T)$ at $n$ different values of $T$, you can recover $E_1, \ldots, E_n$.)

On the other hand, maybe I know what you mean. For instance, we just computed the $\alpha$-cardinalities of the probability spaces associated to $E_1, \ldots, E_n$ directly from the partition function, i.e. without having to work out the $E_i$s.

Posted by: Tom Leinster on November 10, 2008 3:23 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Just a minor point, but the energies $E_i$ don’t have to be non-negative. Physically, the only thing that ever really matters is differences between energies (unless you are doing general relativity), so you can just add some constant $c$ onto all the energies. This gives you an extra factor $e^{-\beta c}$ in all your probability factors, but the same factor appears in your partition coefficient, hence cancels out of the probabilities.

The $c$ can be any real number. It will give rise to various addititive or multiplicative constants in various places, but generally these will have no physical significance. In particular, there’s no reason why it can’t be negative and make some or all of the energies negative.

Posted by: Tim Silverman on November 10, 2008 4:44 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

This afternoon we sat in a café (a real café!) and discussed partition functions.

This is nice!

I had gotten a little frustrated about the lack of further joint effort around the Café to further explore the relation of your category cardinality to quantum field theory. Now I am delighted to see the progress. :-)

Posted by: Urs Schreiber on November 10, 2008 6:04 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Tom wrote:

Simon has just arrived for a few days’ stay in Glasgow. This afternoon we sat in a café (a real café!) and discussed partition functions. We figured out various cool things. I’m sure they’re totally standard, but they were new to me […]

Sounds fun! Reinventing stuff based on a few clues is the best way to learn things, and usually you wind up thinking about things in a slightly different way than they’ve ever been thought about before.

My $\beta$ is your $1/k T$.

Wow — you sound like a real professional. In statistical mechanics $\beta$ is the usual abbreviation for $1/ k T$.

One of the profound lessons of statistical mechanics is that energy and temperature are secretly linked, with $k$ being the conversion factor — just as one of the profound realizations of special relativity is that space and time are secretly linked, with the speed of light being the conversion factor.

But another profound lesson is that $1/k T$ is more important than temperature, and deserves its own name. For example: the ‘third law of thermodynamics’ says $T = 0$ can never be attained. But, you can make the temperature negative, or even infinite! This seems very weird at first — but it seems perfectly sensible if you treat $\beta = 1 / k T$ as fundamental.

So, I’m all for using $\beta$. But it’s more fun if we realize that $\beta = 1 / k T$.

$Z(\beta)$ is the total measure of $A(\beta)$. (Now I’m using the notation slightly differently from you, since you have $Z$ as a function of $T$ rather than $\beta$.)

Again, you’ve switched to doing things the way grown-up physicists do it: $\beta$ is better.

The formula is

$|\overline{A}(\beta)|= Z(\beta) \cdot \exp(-\beta \frac{d}{d\beta} log Z(\beta))$

Interesting quantity! Physicists usually think about the entropy, not its exponential. So, I guess this formula becomes well-known if you takes it logarithm.

The quantity

$-\frac{d}{d \beta} log Z(\beta)$

is also famous: it’s the ‘expected value of the energy’, that is, the mean value of the function $E$ on probability measure space you’re calling $\overline{A}(\beta)$. This is pretty easy to see by direct calculation.

You say, John, that an ‘amazing amount of information’ can be recovered from the partition function. But it seems to me that you could have put it more strongly than that: all information can be recovered from the partition function.

Gee, and I thought ‘amazing’ was putting it strongly. You’re right — I’d forgotten that. But yeah, what I meant is that it’s very easy to compute lots of interesting stuff directly from the partition function. We’ve just seen how to compute the entropy and the expected value of the energy. It’s also easy to compute the variance of the energy — namely, the mean of its square minus the square of its mean. (Most people think of the variance as the square of standard deviation.) If I remember correctly, the variance of $E$ equals

$-\frac{d^2}{d \beta^2} log Z(\beta)$

More generally, we can compute all the moments of $E$ by taking derivatives of $log Z(\beta)$ and messing around.

Posted by: John Baez on November 10, 2008 5:58 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Another cool thing about partition functions is what happens when you perturb the energy. If the original Hamiltonian $H_0$ is perturbed in proportion to some observable $Q$ (where for the sake of a nice answer, we make the perturbation depend formally on the inverse temperature $\beta$)

(1)$H(n) = H_0(n) + \frac{c}{\beta}Q(n)$

then we get

(2)$\left. \frac{d/d c \; Z(\beta, c)}{Z(\beta, c)}\right|_{c=0} = \frac{\sum_n Q(n) \; exp(-\beta H_0(n))}{Z(\beta)} = \langle Q \rangle(\beta),$

the expectation value of $Q$ at inverse temperature $\beta.$ Taking $Q(n)=H_0(n)$ gives

(3)$\frac{ \sum_n H_0(n) \; exp(-\beta H_0(n)) }{Z(\beta)} = \frac{-d/d \beta \; Z(\beta)}{Z(\beta)} = -\frac{d}{d\beta} \log Z(\beta).$

Energy and surprise are related because you’d be surprised to find a fast-moving particle (high kinetic energy) in a low-temperature gas.

Also, consider electrons in a thin film at low temperature: it requires a lot of energy to move perpendicular to the film, but not much to move within it. It’s clearly a three-dimensional system, but is effectively constrained to two because of the high energies required to access the third dimension. As the temperature rises, the third dimension becomes more and more accessible. So the dimensionality of a system can be temperature-dependent, just like the cardinality.

Posted by: Mike Stay on November 10, 2008 5:38 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

More generally, by taking higher derivatives, we get higher moments or cumulants of the observable (depending on whether we differentiate the partition function or its log). If we take mixed partial derivatives against severable observables, we get covariances and their higher-order analogues.

This stuff is all described by diagrams. In the statistical field theory case, we get Feynman diagrams (with the cumulants given specifically by the connected diagrams), but even in the general case we get to draw circles around collections of dots.

Posted by: Tim Silverman on November 10, 2008 9:31 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Shh! Don’t tell Tom you’re teaching him quantum field theory. Pretend it’s all about biodiversity.

Posted by: John Baez on November 10, 2008 9:41 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Oh, well, he’s bound to find out eventually. We can’t keep it from him forever. :-)

Besides, we’re actually teaching him statistical field theory. Pedagogically, I’m kind of fond of the idea of teaching this first (fewer new concepts to swallow at once).

Posted by: Tim Silverman on November 10, 2008 9:51 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Oh yeah, possibly should mention in passing that the way that the log gives connected diagrams, corresponding to cumumlants, while the partition function itself gives all diagrams, corresponding to moments, is discussed in the Groupoid Cardinality season of JB’s seminars, although rather indirectly since I think he mainly talks about exponentials rather than logarithms. This is also discussed in Chapter 3 of Wilf’s Generatingfunctionology, though with some rather idiosyncratic terminology.

Posted by: Tim Silverman on November 10, 2008 9:41 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

For reference’s sake:

The $n$-th cumulant of $E$ in the canonical ensemble is given by

$\langle E^n \rangle_c = (-1)^n \frac{\partial^n \log Z}{\partial \beta^n},$

where the cumulants of the probability distribution $p(E)$ can be defined via the generating function for the logarithm of the Fourier transform of $p(E)$.

$\log \tilde{p}(k) = \sum_{n=1}^\infty \frac{(-ik)^n}{n!} \langle E^n \rangle_c.$

Posted by: Blake Stacey on November 10, 2008 6:48 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Is there a special relevance for the logarithm of the Fourier transform of the probability density?

I’ve been following this fascinating discussion and it seems to resonate with a lot of stuff I’m doing. I’m sure I’m reinventing wheels left and right myself.

I’ve been playing around with the Levy skew alpha-stable distribution. Trying to sneak in some of the things discussed here into what I’m doing would be fun.

Posted by: Eric on November 10, 2008 7:29 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Well, cumulants can be a nice way to summarize the properties of a probability distribution. For example, a Gaussian is completely specified by its first two cumulants (the mean and the variance). If we start with

$p(x) = \frac{1}{2\pi\sigma^2} \exp\left[-\frac{(x-\lambda)^2}{2\sigma^2}\right],$

the Fourier transform also has a Gaussian shape,

$\tilde{p}(k) = \int_{-\infty}^\infty \frac{1}{2\pi\sigma^2} \exp\left[-\frac{(x-\lambda)^2}{2\sigma^2} - ikx\right] = \exp\left(-ik\lambda - \frac{k^2\sigma^2}{2}\right),$

from which we can read off the cumulants:

$\langle x \rangle_c = \lambda,$

and

$\langle x^2 \rangle_c = \sigma^2,$

with all cumulants $\langle x^n \rangle_c$ of higher order identically zero. By contrast, the moments $\langle x^n \rangle$ are generally non-zero, even for a Gaussian centered at the origin. This has implications for diagrammatic computations in field theory which somebody else could probably explain better than I.

Posted by: Blake Stacey on November 10, 2008 8:20 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Roughly, the $n$th cumulant is that part of the $n$th moment that is only due to new things happening in the $n$th power, and excludes the parts due to things happening at lower powers. For instance, part of the second moment comes from a non-zero mean. The variance is what you get if you throw away this part. And so forth.

Posted by: Tim Silverman on November 10, 2008 9:16 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

One thing I recently rediscovered was that if $p_{0,1}(x)$ represents the density for the change in some random variable between $t=0$ and $t=1$, and $p_{1,2}(x)$ represents the change from $t=1$ to $t=2$ and if we further assume the changes are independent, then

$p_{0,2}(x) = p_{0,1}(x) * p_{1,2}(x),$

where $*$ denotes convolution. Moving to Fourier space, we have

$\tilde{p}_{0,2}(k) = \tilde{p}_{0,1}(k) \tilde{p}_{1,2}(k).$

If we introduce

$\psi(k) = \log \tilde{p}(k),$

then we have the interesting (to me anyway) relation

$\psi_{0,2}(k) = \psi_{0,1}(k) + \psi_{1,2}(k).$

For the Gaussian density, as you noted, we have

$\psi(k) = -ik\lambda-\frac{k^2\sigma^2}{2}$

so that $\lambda_{0,2} = \lambda_{0,1} + \lambda_{1,2}$ and $\sigma^2_{0,2} = \sigma^2_{0,1} + \sigma^2_{1,2}.$

For a Brownian motion with $\lambda_{0,1} = \lambda_{1,2} = \lambda$ and $\sigma_{0,1} = \sigma_{1,2} = \sigma$, we have the familiar expressions

$\lambda_{0,2} = 2\lambda$

and

$\sigma_{0,2} = \sqrt{2} \sigma.$

Looking at things this way helps (me anyway) when $p(x)$ is not Gaussian.

I think this can be tied to the original topic of this very neat discussion some how. Probably via random walks, which are the discrete versions of Brownian motions.

Posted by: Eric on November 10, 2008 10:21 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Well, a probability distribution spreading out with the square root of time is familiar diffusion behaviour, and I have seen population gene pools modelled with a Fokker-Planck diffusion equation. For example, if you have a diploid species (two copies of each chromosome) in which some gene of interest has two different alleles, $A_1$ and $A_2$, then the frequency of allele $A_1$ is

$x = \frac{[A_1]}{[A_1] + [A_2]} = \frac{i}{2N},$

where $i$ is a nonnegative integer and $N$ is the population size. If we assume a panmixia in which individuals mate at random and the offspring receive one chromosome from each parent, then we can write a diffusion equation for $p(x,t)$, giving its time evolution in terms of mating, mutation and selection pressure:

$\frac{\partial p(x,t)}{\partial t} = -\frac{\partial}{\partial x} [M(x) p(x,t)] + \frac{1}{2} \frac{\partial^2}{\partial x^2} [V(x) p(x,t)].$

Here, the diffusion coefficient $V(x)$ incorporates the effects of mating, which assuming a panmixia turns out to be modelled by the binomial distribution, giving

$V(x) = \frac{x(1-x)}{2N}.$

The drift coefficient $M(x)$ can be chosen to reflect mutations flipping alleles from type 1 to type 2 and vice versa:

$M(x) = \mu_1 (1-x) - \mu_2 x.$

If there is selection pressure at work — say, if the phenotype corresponding to a doubly recessive individual is less able to survive in the wild — then $M(x)$ can include a selection effect:

$M(x) = \mu_1 (1-x) - \mu_2 x + \frac{x(1-x)}{2} \frac{d}{dx}\log \bar{w},$

where the average fitness $\bar{w}$ is something like

$\bar{w} = x^2 w_{11} + 2x(1-x)w_{12} + (1-x)^2 w_{22}.$

If allele $A_1$ were the sickle-cell variant of haemoglobin, for example, then the fitness weighting of the recessive homozygote $w_{11}$ would be substantially less than $w_{22}$. However, carrying one sickle-cell allele gives an individual some protection against malaria, and so $w_{12}$ could be larger than $w_{22}$.

All of this pertains to the diversity of genotypes present within a single species, which is similar to but not quite the same as the diversity of species present within an ecosystem.

Posted by: Blake Stacey on November 11, 2008 1:22 AM | Permalink | Reply to this
Read the post When entropy and ecology collide...
Weblog: Evolving Thoughts
Excerpt: ...the albino silverback blinks once or twice, says knowingly "Yes, yes", and sends those who do understand math to these two posts at The n-Category Café: "Entropy, Diversity and Cardinality" post 1, post 2. If I read it aright,...
Tracked: November 10, 2008 1:31 AM

### Re: Entropy, Diversity and Cardinality (Part 2)

Tom, have you looked into “continuum limits” of cardinalities of metric spaces?

What I mean is: families of metric spaces $A_N$ parameterized by $N \in \mathbb{N}$ or maybe $N \in \mathbb{N}^k$ such that $A_N = \{a_1, \cdots, a_N\}$ and $(d_N)_{ij}$ is such that the limit

$lim_{N \to \infty} |A_N|$

exists.

I am interested in knowing:

suppose we have given some Riemannian metric tensor $g$ on $\mathbb{R}^2$. Can you sprinkle for each $N$ points $A_N$ on $\mathbb{R}^2$ and equip them with a $Z_N$ such that for each ball $B \subset \mathbb{R}^2$ we find $\int_B vol_g = lim_N |A_N|_B| \,,$ where $A_N|_B$ denotes the finite metric space obtained by restricting to those points which lie in $B$.

Posted by: Urs Schreiber on November 10, 2008 6:42 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

It might be fun to look at this for the case of a binary tree (if that makes sense) under two different limits $\Delta t\to 0$ with:

1.) $\Delta x = c \Delta t$ (Exterior Calculus)

2.) $\Delta x = \sqrt{\Delta t}$ (Stochastic Calculus)

(Motivations are Section 4 and Section 5 here)

If I can ever dig out of my hole, I’ll give it a try.

Posted by: Eric on November 10, 2008 7:43 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Here is my stab at it…

My favorite finite space is the binary tree, but the binary tree is more of a finite model of space and time. The binary tree at one slice in time is a finite model for the real line.

The distance between points is uniform so that

$d_{ij} = |i-j|\Delta x.$

We want to think of the binary tree as representing a random walk where the probability of moving up (down) is $p_u = 1/2$ ($p_d = 1/2$). The probability of being at point $i$ at step $N$ is then given by the binomial distribution

$p_{i,N} = \binom{N}{i}\frac{1}{2^N}.$

I worked out the first few cases by hand:

Case: $N = 1, p_0 = p_1 = 1/2$

The $\alpha = 1$ cardinality of the binary tree at step $N = 1$, with $p_0 = p_1 = 1/2$ is

$|A_1|_1 = 2$

[Note: At step $N$, there are $N+1$ points, so that $|A_N|_1 \le N+1$.]

Case: $N = 2, p_0 = p_2 = 1/4, p_1 = 1/2$

$|A_2|_1 = 2^{3/2} ~ 2.82.$

Case: $N = 3, p_0 = p_3 = 1/8, p_1 = p_2 = 3/8$

$|A_3|_1 = (16/3)^{3/4} ~ 3.51.$

Then, if I cheat and look at wikipedia, I see that

$H_1(A_N) = \frac{1}{2} log \frac{\pi N e}{2} + O(\frac{1}{N}),$

which means that as $N$ increases

$|A_N|_1 ~ \sqrt{N}.$

But since time increases as the square root of space, i.e.

$\Delta x = \sqrt{\Delta t},$

this means that the cardinality of this finite space grows directly with time (since space is proportional to $N$).

I’m not sure what this means, if anything, but thought it was kind of neat.

Posted by: Eric on November 11, 2008 1:02 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Oopsies…

But since time increases as the square root of space

Should be….

But since space increases as the square root of time

I think I misinterpreted..

It should be that time increases with $N$ so that the cardinality increases with the size of the space (rather than the number of points in the space). I think that is actually neater.

Posted by: Eric on November 11, 2008 1:08 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Argh. I’m tired and suffering from caffeine overdose. Please ignore everything I said after

$|A_N| ~ \sqrt{N}.$

The math is more correct than the words.

Posted by: Eric on November 11, 2008 1:13 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Urs wrote:

Tom, have you looked into “continuum limits” of cardinalities of metric spaces?

This sounds rather like what I suggested in my original post about cardinality of metric spaces: to get the cardinality of a compact metric space $X$, try taking a nested sequence $A_0 \subseteq A_1 \subseteq \cdots$ of finite subspaces of $X$, whose union is dense in $X$, and put $|X| = lim_{N \to \infty} |A_N|.$ (Of course, there are multiple ways in which this might go wrong.) Is that the kind of thing you had in mind? If not, I guess I don’t get your idea.

Posted by: Tom Leinster on November 12, 2008 4:15 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Tom wrote:

This principle is often formalized as ‘concavity’. Suppose you own two neighbouring fields, each containing a hundred farmyard animals. You open a gate between the fields, effectively making one big field containing two hundred animals. If one of the original fields contained just sheep and the other contained just cows, you would have combined two ecosystems of diversity 0 to make one ecosystem of diversity > 0.

A similar thought experiment is often discussed in statistical mechanics: take a box of oxygen and a box of nitrogen separated by a wall, poke a hole in the wall, and figure out how the entropy goes up as the gases mix.

A closely related puzzle is called the Gibbs paradox. Why doesn’t the entropy go up if both boxes are initially filled with oxygen?

Or does it? After you poke the hole in the wall, you know less about where each molecule is. More ignorance means more entropy, right? But in statistical mechanics, a process that increases entropy can typically be milked to do useful work, and that doesn’t seem to be the case here.

This puzzle was raised by Gibbs back in 1876. It’s been resolved many times in many ways, but it still triggers arguments. The Wikipedia article I linked to above has a good introduction to the issues.

And these issues are actually related to biodiversity! Tom has raised questions like “Does an ecosystem have high biodiversity if it has dozens of species of eels indistinguishable except by experts?” Similarly: “Should we say the entropy increases when we poke a hole between boxes that hold gases that are only distinguishable by subtle tests — say, two isotopes of oxygen?”

There’s a nice discussion of these ‘similarity’ puzzles in that Wikipedia article. The article also has links to further discussion…

Posted by: John Baez on November 10, 2008 9:39 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Is there a continuation in this context of what you told us about operads in the first part?

Posted by: David Corfield on November 12, 2008 8:59 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Good question.

Last time we had an operad whose $n$-ary operations were the probability structures ($=$ probability distributions) on $\{1, \ldots, n\}$. So, the obvious ‘continuation’ would be to build an operad whose $n$-ary operations are the probability metric structures on $\{1, \ldots, n\}$. I know of no useful way of doing this.

The problem is this. Given a probability metric space $A$ with $n$ points and, for each $i \in \{ 1,\ldots, n\}$, a probability metric space $X_i$ with $k_i$ points, we have to define some sort of ‘composite’ probability metric space with $k_1 + \cdots + k_n$ points. In other words, we have to define both a probability distribution and a metric on the set $\{ 1, \ldots, k_1 + \cdots + k_n\}.$ We know what probability distribution to put on it. (That was the operad structure in the ‘Aside’ of Part 1.) The question is, what’s the metric?

Let’s do a simple case to see the difficulty. Suppose that $n = 2$, so that $A$ is a 2-point space and we have two other spaces, $X_1$ and $X_2$. The composite $A \circ (X_1, X_2)$ is supposed to be a space whose underlying set is the disjoint union of the underlying sets of $X_1$ and $X_2$. So we imagine taking the disjoint union of $X_1$ and $X_2$ and putting a metric on it. It’s easy to guess what the distance should be between two points of $X_1$, and similarly between two points of $X_2$. But we also have to define the distance between a point of $X_1$ and a point of $X_2$, in some way that depends on the distance between the two points of $A$. I don’t see how to do this.

There’s something to say, though.

First, going back to Part 1, an operadic composite $X = A \circ (X_1, \ldots, X_n)$ of probability spaces can be thought of as a space $X \to A$ ‘fibred’ over $A$ (though really the map is just a plain old measure-preserving map, meaning that the measure of a subset of $A$ is equal to the measure of its preimage in $X$). The fibres are the $X_i$s. In Part 1 we saw a formula for the entropy of $X$ in terms of $A$ and the entropies of the $X_i$s.

Now, there’s a definition of ‘fibration’ of metric spaces (which I’ll reveal on request). Better, there’s also a definition of fibration of probability metric spaces. And when you have a fibration $X \to A$ of finite probability metric spaces, there’s a formula for the 1-entropy of $X$ in terms of $A$ and the 1-entropies of the fibres $X_i$. It looks identical to the formula we had before.

Posted by: Tom Leinster on November 12, 2008 11:33 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

But we also have to define the distance between a point of $X_1$ and a point of $X_2$, in some way that depends on the distance between the two points of $A$. I don’t see how to do this.

Would the ‘obvious’ way of declaring the distance between points from different subsets to be the distance between the two points of the space $A$ not work?

If I have what I take to be a two species ecology of earwigs and eels, and later refine them into greater and lesser subspecies, it seems reasonable to take as distance between greater eel and lesser earwig, the original distance between eel and earwig, doesn’t it?

How do we request you reveal your definition of fibration of metric spaces?

Posted by: David Corfield on November 12, 2008 3:49 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Would the ‘obvious’ way of declaring the distance between points from different subsets to be the distance between the two points of the space $A$ not work?

No: what you get might violate the triangle inequality.

How do we request you reveal your definition of fibration of metric spaces?

Like that.

Let $A$ and $B$ be metric spaces. (I won’t assume that they’re symmetric, and I’ll allow $\infty$ as a distance.) Let $p: A \to B$ be a map, and for each $b \in B$, write $A_b = p^{-1}\{b\}$. Then $p$ is called (by me) a fibration if

for all $b, b' \in B$ with $d(b, b')  <  \infty$,
for all $a \in A_b$, there exists $a_{b'} \in A_{b'}$
such that for all $a' \in A_{b'}$,
$d(a, a') = d(b, b') + d(a_{b'}, a')$. This seems to be the right thing. How it fits in more generally is another question; there’s much I don’t understand about fibrations of enriched categories.

Posted by: Tom Leinster on November 12, 2008 6:32 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

No: what you get might violate the triangle inequality.

So if three points were in distinct $X_i$, things would be OK. And if they’re in the same $X_i$, it’s also OK.

The worry, I take it, is when points 1 and 3 are in $X_1$ and point 2 is in $X_2$, and distances in $A$ are such that jumping there and back between two points of $A$ is shorter than jumping from point 1 to 3 in $X_1$.

So that’s as though the distance between the greater eel and the lesser eel were greater than twice the distance between eel and earwig. Hmm.

Is there no way to stop that happening?

Posted by: David Corfield on November 12, 2008 6:59 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

From my time in machine learning, I remember interest in ultrametric distances to help with hierarchical classifications. There you’re controlling distances so that points with a more recent ancestor are always closer than those with an earlier ancestor.

Posted by: David Corfield on November 12, 2008 7:22 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

David wrote:

Is there no way to stop that happening?

I don’t think I’d want to stop it.

Let’s think about a slightly different situation: topological rather than metric spaces. Suppose you were given a topological space $A$ together with a family $(X_a)_{a \in A}$ of topological spaces, and someone asked you to topologize the disjoint union $\sum_a X_a$. There’s no obvious way to do it — at least, no obvious way that makes use of the topology on $A$. In order to assemble the various ‘fibres’ $X_a$ into one big space, you need to know they’re related as $a$ moves around in $A$.

Similarly, in the metric situation, you need to know how the spaces $X_i$ and $X_j$ are related for different $i, j \in A$.

As you know, there are many theorems saying something like ‘a fibration over $B$ is the same thing as a $B$-parametrized family’. Here’s one of them.

Notation: given maps $f, g: A \to B$ of metric spaces, I’ll write $d(f, g)$ for $\sup\{d(f(a), g(a))| a \in A\}$. (This is distance in the enriched functor category $[A, B]$.)

Theorem  Let $B$ be a metric space. Then a fibration over $B$ amounts to

• a family $(A_b)_{b \in B}$ of metric spaces
• a family $(f_{b, b'}: A_b \to A_{b'})$ of maps of metric spaces, indexed over pairs $(b, b')$ of points of $B$ such that $d(b, b')  <  \infty$

such that

• for all $b \in B$, we have $d(f_{b, b}, id) = 0$
• for all $b, b', b'' \in B$ with $d(b, b'), d(b', b'')  <  \infty$, we have $d(f_{b, b''}, f_{b', b''} \circ f_{b, b'})  \leq  d(b, b') + d(b', b'') - d(b, b'').$

The ‘transfer maps’ $f_{b, b'}$ are what relate the various fibres. In the definition of fibration, the point ‘$a_{b'}$’ would be $f_{b, b'}(a)$.

(And, incidentally, if you want to do probability metric spaces you ask for the transfer maps to be measure-preserving.)

I guess there’s a general enriched category result of which this is a special case. I was deep into working out what it must be when a jellyfish stung me; I haven’t been back to it since.

Posted by: Tom Leinster on November 12, 2008 8:08 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Ecologically speaking, your fibration of metric spaces is a little odd. For every fish there would have to be a mammal most like it.

Hmmm, Shark - Dolphin, Eel - Otter, Haddock - ?.

I wonder how ecologists come up with their distances. How do they avoid the John Locke error? He argued against natural kinds in the animal kingdom by claiming that there is a continuum between fish and birds, with little difference between flying fish and diving birds.

I mentioned ultrametric distances in hierarchical classifications above. Here is a paper using $p$-adic numbers for such purposes. I suppose this is as far away as possible from your fibrations, with all elements of a fibre being the same distance from one of another fibre.

What’s a simple (non-trivial) example of a fibration of metric spaces? The projection of $\mathbb{R}^2$ onto its first component doesn’t count. For an element in a fibre there is a nearest element in another fibre, but it’s squares of distances which is needed to calculate distances to other points in the fibre.

Posted by: David Corfield on November 13, 2008 9:55 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

I wonder how ecologists come up with their distances. How do they avoid the John Locke error? He argued against natural kinds in the animal kingdom by claiming that there is a continuum between fish and birds, with little difference between flying fish and diving birds.

That’s not specifically Locke’s error, though, is it? I believe it goes back at least to Aristotle, and perhaps beyond him to Greek (and pre-Greek) folklore. In fact, on this subject, see Lovejoy, The Great Chain of Being (pretty much passim), where he talks about the widespread meme of nature as a continuous plenum without gaps or natural borders.

Posted by: Tim Silverman on November 13, 2008 3:54 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Interesting. I know that Locke used his example as empirical evidence against natural borders. He also had a ‘conceptual’ argument to the effect that he couldn’t conceive how we could look beyond the appearances of, say, samples of yellowish metal to ascertain the essential properties of gold.

Posted by: David Corfield on November 14, 2008 8:51 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

I [David] mentioned ultrametric distances in hierarchical classifications above. … I suppose this is as far away as possible from your [Tom’s] fibrations, with all elements of a fibre being the same distance from one of another fibre.

Phylogenetically, this is correct (at least for multicellular creatures), but an ultrametric may not be right ecologically. Distantly related species may occupy similar niches while closely related species occupy very different niches; it’s more analogy than homology that matters. (Technically, a similarity doesn’t count as analogous unless it’s not also homologous, but as a mathematician I know that that’s a foolish definition. Incidentally, does anybody know a good web link for the difference between these?)

Posted by: Toby Bartels on November 14, 2008 6:45 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

David wrote:

What’s a simple (non-trivial) example of a fibration of metric spaces?

I never answered that directly. Let me do it now.

For any $x  >  0$, let $C_x$ be the circle of circumference $x$, where the distance from one point $a$ to another point $b$ is the length of the clockwise arc from $a$ to $b$. (This is a non-symmetric metric.) For each $n \geq 1$, there is an $n$-fold cover of $C_x$ by $C_{n x}$. This is a fibration. When $n \geq 2$, it is non-trivial in the sense of not being a product-projection.

One reason why I think I’ve got the appropriate definition of fibration is that it allows you to prove a cardinality formula of the usual kind. In this example the fibre $F$ is constant: it’s the set of vertices of a regular $n$-gon, metrized as a subspace of $C_{n x}$. So $|C_{n x}| = |C_x| \times |F|.$ In fact, we already calculated all three cardinalities elsewhere, and you can check directly that this equation holds.

(I’ve assumed throughout this comment that all the cardinalities involved are well-defined, which in the case of circles remains to be established.)

Posted by: Tom Leinster on November 17, 2008 8:07 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

It’s interesting that the symmetric version of this doesn’t work. I did the calculation back in February, but I guess I didn’t do it for the non-symmetric circle.

What I mean by not working is that if you have the similar map from ${C^{sym}_{n x}}$ the symmetrically metric circle of circumference $n x$ to ${C^{sym}_{x}}$ the circle of circumference $x$ then this isn’t a fibration in Tom’s sense, and furthmore you don’t have the product formula, so

(1)$|{C^{sym}_{n x}}| \ne |{C^{sym}_{x}}|\times |F|.$

Of course this is a mixed blessing. It means that the cardinality doesn’t have this nice property, which would aid calculation, but instead it’s more subtle and hence probably a stronger invariant.

However I should point out that this product formula does hold asymptotically. As $x\to\infty$ the cardinalities become respectively $n x$, $x$ and $n$ and you can observe that

(2)$n x = x \times n,$

(as multiplication is commutative).

Posted by: Simon Willerton on November 17, 2008 9:41 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

David wrote:

Ecologically speaking, your [definition of] fibration of metric spaces is a little odd.

I see what you mean. Maybe fibrations aren’t very useful in the ecological setting. However, I don’t want to be tied by that particular application, and your original question was purely mathematical.

Any first projection $B \otimes X \to B$ is a fibration. Here $B \otimes X$ is the set $B \times X$ with the $d_1$ metric (add distances in each coordinate); it’s the tensor product of metric spaces as enriched categories. These would usually be called the ‘trivial fibrations’.

They’re not the only fibrations. In the usual way of things, you can get non-trivial fibrations by gluing together trivial fibrations. For instance, there’s an example that looks like a Mobius strip fibred over a circle.

Re ultrametrics, the paper of Pavoine et al cited in my post discusses them at length. A prime example of an ultrametric on species is this: given two species, measure how far up the evolutionary tree you have to go before you find a common ancestor.

Posted by: Tom Leinster on November 14, 2008 4:59 AM | Permalink | Reply to this

### New definition of entropy? I knew that sooner or later my wanderings would lead me back to these posts of Tom. I have a number of peanuts to throw from the gallery, all on different aspects of it. I’d like to start with this one:

Can it be that the ordinary definition of entropy is “wrong”, while Tom-and-the-population-biologist’s definition is “right”?

I feel that Tom’s refined notion of entropy (which takes into account the distance between states) completely resolves the Gibbs paradox, as hinted at by John above. Can somebody who knows stuff about statistical physics explain more about this point?

Four months has gone by, so let me recall the situation (I’m not going to bother with the “$\alpha$-entropies” - I’m just going to set $\alpha=1$). Say a system can be in a bunch of states numbered $1, 2, \ldots , N$, and the probability for being in each state $i$ is given by $p_i$. The ordinary, information-theoretic definition of the entropy of the probability distribution $(p_1, p_2, \ldots, p_n)$ is

(1)$S = -\sum_i p_i \log p_i.$

Tom’s definition proceeds from the observation that the states are not just ‘distinguishable’, but rather we can usually measure the distance $d_{ij}$ between states, and we should take these distances into account to refine our notion of entropy. In this way, we define $Z_ij$ as the matrix recording these distances,

(2)$Z_{ij} = e^{-d_{ij}},$

and then we declare that we are dealing with not simply a run-of-the-mill probability space, but rather a probability metric space. Its entropy is now defined as

(3)$S' = - \sum_i p_i \log(Zp)_i .$

This makes a lot of sense to me. Let’s recall the Gibbs paradox:

It involves the discontinuous nature of the entropy of mixing. This discontinuous nature is paradoxical to the continuous nature of entropy itself.

Suppose we have a box divided in half by a movable partition. On one side of the box is an ideal gas A, and on the other side is an ideal gas B at the same temperature and pressure. When the partition is removed, the two gases mix, and the entropy of the system increases because there is a larger degree of uncertainty in the position of the particles. It can be shown that the entropy of mixing multiplied by the temperature is equal to the amount of work one must do in order to restore the original conditions: gas A on one side, gas B on the other. If the gases are the same, no work is needed, but given a tiniest difference between the two, the work needed jumps to a large value, and furthermore it is the same value as when the difference between the two gases is great. (italics added by me)

The point is that there would be no discontinuity if we had from the outset taken the distances into account. That is, we would declare the distance between the states to have a factor specifying the distance between the “types of molecule” involved (so an oxygen and a lead molecule would have a large difference, but an oxygen and an oxygen molecule would have a difference of zero).

Looking on Wikipedia, I am reminded of the famous physics definition of entropy:

(4)$S = k \log \Omega.$

That is, the entropy is proportional to the logarithm of the number of microscopic configurations that result in the observed macroscopic description of the system. Apparantly it’s not completely clear how to relate this to the information theory idea of entropy as given above. I would be very grateful if someone could explain this a bit more to me. My intuition is that Tom’s definition gives us such a bridge - because in order to calculate that $\Omega$ occurring above, it seems we will need a metric on our space of states. Is this reasoning valid?

Posted by: Bruce Bartlett on March 14, 2009 6:49 PM | Permalink | Reply to this

### Feynman, Bennett, Maxwell, Loschmidt, Boltzmann; Re: New definition of entropy?

We have only a very vague understanding of thermodynamics in General Relativity. There is no accepted well-formed framework which would allow one to make good scientific sense of entropy and probability “of the universe”. For instance it is unclear which level of coarse-graining is assumed for the definition of entropy.

Feynman showed me the Charles Bennett connection of Information entropy to Thermodynamic entropy via Maxwell’s Demon. Cf. Maxwell’s Demon 2: Entropy, Classical and Quantum Information, Computing (Paperback)
by Harvey Leff (Editor), Andrew F. Rex (Editor)

Over 130 years ago, James Clerk Maxwell introduced his hypothetical “demon” as a challenge to the scope of the second law of thermodynamics. Fascination with the demon persisted throughout the development of statistical and quantum physics, information theory, and computer science, and links have been established between Maxwell’s demon and each of these disciplines. The demon’s seductive quality makes it appealing to physical scientists, engineers, computer scientists, biologists, psychologists, and historians and philosophers of science. Since the publication of Maxwell’s Demon: Entropy, Information, Computing in 1990, Maxwell’s demon has been the subject of renewed and increased interest by numerous researchers in the fields mentioned above. Updated and expanded, Maxwell’s Demon 2: Entropy, Classical and Quantum Information, Computing retains many of the seminal papers that appeared in the first edition, including the original thoughts of James Clerk Maxwell and William Thomson; a historical review by Martin Klein; and key articles by Leo Szilard, Leon Brillouin, Rolf Landauer, and Charles Bennett that led to new branches of research on the demon. This second edition contains newer articles by Landauer, Bennett, and others, related to Landauer’s principle; connections with quantum mechanics; algorithmic information; and the thermodynamics and limits of computation. The book also includes two separate bibliographies: an alphabetical listing by author and a chronological bibliography that is annotated by the editors and contains selected quotes from the books and articles listed. The bibliography has more than doubled in size since publication of the first edition and now contains over 570 entries.

I also am haunted by Loschmidt’s paradox, also known as the reversibility paradox, currently summarized in wikipedia beginning: the objection that it should not be possible to deduce an irreversible process from time-symmetric dynamics. This puts the time reversal symmetry of (almost) all known low-level fundamental physical processes at odds with any attempt to infer from them the second law of thermodynamics which describes the behaviour of macroscopic systems. Both of these are well-accepted principles in physics, with sound observational and theoretical support, yet they seem to be in conflict; hence the paradox.

Johann Loschmidt’s criticism was provoked by the H-theorem of Boltzmann, which was an attempt to explain using kinetic theory the increase of entropy in an ideal gas from a non-equilibrium state, when the molecules of the gas are allowed to collide. Loschmidt pointed out in 1876 that if there is a motion of a system from time t0 to time t1 to time t2 that leads to a steady decrease of H (increase of entropy) with time, then there is another allowed state of motion of the system at t1, found by reversing all the velocities, in which H must increase. This revealed that one of the key assumptions in Boltzmann’s theorem was flawed, namely that of molecular chaos, that all the particle velocities were completely uncorrelated. One can assert that the correlations are uninteresting, and therefore decide to ignore them; but if one does so, one has changed the conceptual system, injecting an element of time-asymmetry by that very action.

Reversible laws of motion cannot explain why we experience our world to be in such a comparatively low state of entropy at the moment (compared to the equilibrium entropy of universal heat death); and to have been at even lower entropy in the past….

Posted by: Jonathan Vos Post on March 14, 2009 9:58 PM | Permalink | Reply to this

### Re: New definition of entropy?

Von Neumann clearly thought there was a close connection between physical and information-theoretic entropy. Apparently he persuaded Shannon to use the physicists’ word:

You should call it ‘entropy’ and for two reasons; first, the function is already in use in thermodynamics under that name; second, and more importantly, most people don’t know what entropy really is, and if you use the word ‘entropy’ in an argument you will win every time.

I found this quote on p.18 of a nice book by Oliver Johnson, Information Theory and the Central Limit Theorem. And, in fact, Johnson describes the connection between the two entropies on the very same page, a copy of which is here.

Anyway, your main point was that the Gibbs Paradox on the ‘discontinuous nature of the entropy of mixing’ is resolved if we’re sensitive to the similarity of the two gases… right? I don’t know how this relates to the three graphs on the Wikipedia page.

Since I’ve been concentrating on ecology rather than statistical physics, I’ve been thinking less about gases and more about lemurs.

Sometimes two subspecies of the same species are reclassified as separate species. For example, this happened in December with the Red Lemur and the Red-fronted Lemur, based on new evidence (some genetic). (Like gas molecules, lemurs are apt to jump around all over the place.) Imagine a population consisting of equal numbers of red and red-fronted lemurs. If we were very crude and decreed that two animals were either in the same species or not, then the Shannon entropy of our population would be $0$… until reclassification, when it would suddenly jump to $\log(2)$. But, of course, a more nuanced approach taking similarity into account (e.g. based on genetic distance) would see only a small change in entropy.

Posted by: Tom Leinster on March 15, 2009 7:38 AM | Permalink | Reply to this

### Re: New definition of entropy?

Incidentally, I’m guiltily aware of committing what I gather is a misdemeanour in academic ecological circles: the invoking of charismatic megafauna. A pro would perhaps have talked about the taxonomic reclassification of lichen, secure in the knowledge that lichen’s just as interesting and important as lemurs.

Posted by: Tom Leinster on March 15, 2009 7:53 AM | Permalink | Reply to this

### Re: New definition of entropy?

Might someone explain to me the argument equating the physicists’ definition of entropy $S = k \log \Omega$ with the information-theorists’ one? I’m getting stuck on the first equation in (1.79):

Suppose each microstate $r$ can occur with probability $p_r$. Consider a very large ensemble of $v$ replicas of the same system, then on average there will be $v_r$ replicas in the $r$th microstate, where $v_r$ is the closest integer to $vp_r$. In this case,

(1)$\Omega = \frac{v!}{v_1 ! v_2 ! \cdots v_k !}$
Posted by: Bruce Bartlett on March 15, 2009 4:27 PM | Permalink | Reply to this

### Re: New definition of entropy?

Tom Leinster invited me to drop by, and I am happy to explain as best I can (it’s not my argument, I should add, but I am happy to try to explain it anyway!!).

I don’t think it is meant to be a 100% rigorous derivation of anything, more a plausibility argument that these two quantities should coincide. The part of the argument above the displayed equations is particularly hand-wavy - of course there won’t be exactly $\nu_r$ versions of this state, but it’s a law-of-large-numbers guesstimate.

Then, the displayed equation comes out of a counting argument. That is, in total there are $\nu!$ ways of arranging the $\nu$ objects (replicas) in order. However, this is overcounting, since there are $\nu_1$ identical objects of type 1, so we need to divide by a factor of $\nu_1!$, and similarly $\nu_2!$ and so on.

It may be that the language I am expressing it in is not familiar to physicists - the argument is on Page 29 of Statistical Mechanics by Singer (and probably in Mandl as well!)

I hope this makes sense..

Posted by: Olly Johnson on March 16, 2009 3:47 PM | Permalink | Reply to this

### Re: New definition of entropy?

The question is, how much work can we extract by letting the two gasses mix? This depends not only on the nature of the gasses but also on the nature of the engine we use to extract the work. If the engine can’t tell the two gasses apart (e.g. because they’re identical), then it obviously can’t extract any work. If it can tell them apart perfectly (I mean in a practical way that enables it to extract work as they mix, not just in some theoretical, “Hmm, it says in this book that these gasses are different” way), then it can extract the maximum work implied by the Second Law. If its discriminating abilities are imperfect, (e.g. because the two gasses are very similar, or because the engine’s filter is a bit crude) then it will be able to extract an intermediate amount of work. The entropy describes the amount of implicit “knowledge” that the engine possesses about the system by virtue whatever the mechanisms are that it’s using to extract work. That’s why there’s a sort of subjective element to entropy, even though once you’ve accepted a particular subjective view (“These gasses are, for practical purposes, the same”), the entropy is strictly defined. Of course, in calculations, you tend to abstract away from particular engines, with their subjective viewpoints (or levels of “knowledge”), by assuming ideal engines with the maximum discrimination implied by the problem at hand.

That’s one way to look at it, anyway. If the gasses’ being “very slightly” different means that you’re taking the view that even an ideal engine couldn’t discriminate between them perfectly, then you could adapt your definition of entropy accordingly.

Posted by: Tim Silverman on March 16, 2009 12:38 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Incidentally, here is an intuitive way of thinking of Tom’s “weightings” (recall that Tom defines the cardinality of a finite metric space as the the sum of the weightings). The way of thinking I am presenting here is basically Tom’s money transfer idea.

Suppose you have a bunch of families - the Johnsons, the Winterbottoms, the Rayleighs and the Snapes: Think of this as a metric space where each family is a “point” and the distance between families measures how closely they are related.

Suppose now the patriarchs of all these families are writing up their wills. Since their grandchildren have been naughty and have forgotten family ties, they decide to play a naughty game on their respective families. Grandfather Johnson announces to his family: “I am going to give away my money to my relatives in order of their closeness. I will give 10% to each honourable Leinster, 5% to each honourable Rayleigh and to each honourable Snape, and 2% to each honourable Winterbottom! Be warned though: for each scoundrel in these families, I will extract that money and not give it.” The other patriarchs make similar announcements. Their grandkids are a bit confused — now they have to figure out how many Johnsons, Winterbottoms, etc. there are!

This number is the weighting of that family. What looked like a point from a distance, is revealed to consist of a number of points. It also explains how the weighting of a metric space can be negative — this occurs when most of the points are scoundrels!

Posted by: Bruce Bartlett on March 15, 2009 5:33 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

I would like to make contact somehow between entropy, diversity, cardinality etc. and random walks on graphs.

Suppose we have a finite metric space, and we are able to put a weighting on it — i.e. a number $w_i$ assigned to each point $i$ such that for all points $i$,

(1)$\sum_j w_i e^{-d_{ij}} = 1.$

Then Tom’s Rough Answer above says that the distribution which will maximize the entropy is precisely the weight distribution — the distribution where the probability $p_i$ at a point $i$ is proportional to the weight $w_i$ of that point.

The weight distribution is secretly the uniform distribution, because each point $i$ secretly consists of $w_i$ points when we look closely! (See the family example).

The first thing to realize is that given the weighting $w$, we can consider our metric space as a “park” in which someone can take a random walk. If you are at point $i$, then the probability that you go to point $j$ in the next step is given by

(2)$P_{ij} = w_j e^{-d_{ij}}.$

When people talk about random walks, they appear to talk about the concept of a ‘limiting distribution’ — some kind of notion about the limiting behaviour of the paths as the number of steps go to infinity. Is this “limiting distribution” the weight distribution?

Posted by: Bruce Bartlett on March 15, 2009 5:50 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Again this makes me think of Johnson’s book. A major theme of the book is that the Central Limit Theorem can be understood as a result about maximum entropy.

First ‘recall’ the usual statement of the Central Limit Theorem. Suppose we roll a fair die once. Then the probability distribution is uniform: Now suppose we roll it twice and take the total. The distribution is no longer uniform; it’s like one of the pyramids of Egypt: Now suppose we roll it three times and take the total. The distribution now becomes more curvaceous: And you probably know what’s coming: as the number of rolls increases, the shape converges towards the normal distribution.

The Central Limit Theorem says that this happens whatever probability distribution you start with — it doesn’t have to be uniform. (It’s really meant to be a continuous distribution, too, not a discrete one like my example.)

So, take a probability distribution $f$ on $\mathbb{R}$. Let $s_n$ be the probability distribution corresponding to the sum of $n$ numbers chosen at random according to $f$. We can’t quite say that $(s_n)$ converges to a normal distribution, because its mean and standard deviation will change with $n$. But look at the graphs above, and you’ll see that I sneakily rescaled them to make them look comparable. Similarly, let’s rescale the probability distribution $s_n$ linearly so that it has the same mean and standard deviation as $f$; call the result $f_n$.

Theorem:  The sequence $(f_n)$ of distributions converges to the normal distribution with the same mean and standard deviation as $f$.

Now here’s the entropy version of the Central Limit Theorem. Intuitively, a distribution has high entropy if it’s ‘well spread out’, or close to uniform. For probability distributions on a finite set, the distribution with the maximum possible entropy is indeed the uniform distribution. If we’re interested in distributions on $\mathbb{R}$ — that is, ‘probability distributions’ in the sense above — then there’s no such thing as the uniform distribution. But we can ask the question:

Among all probability distributions on $\mathbb{R}$ with mean $0$ and variance $1$, which has maximum entropy?

And, as we discussed before, the answer is the normal distribution. (And of course the same is true for any specified mean and variance.) That’s the entropy version of the Central Limit Theorem.

So what’s the connection between the two versions? What makes it fair to call them both the ‘Central Limit Theorem’? As I understand it, the story goes something like this.

Given a probability distribution $f$ on $\mathbb{R}$, let $s$ be the probability distribution of the sum of two numbers chosen at random according to $f$, and let $\Phi(f)$ be $s$ rescaled to have the same mean and variance as $f$. (In the notation above, $s = s_2$ and $\Phi(f) = f_2$.) Then $\Phi$ is an endomorphism of the space of all probability distributions. Moreover, you’d expect $\Phi(f)$ to have higher entropy than $f$, because by running two trials we’ve spread things out a bit. So the sequence $f, \Phi(f), \Phi^2(f), \Phi^3(f), \ldots$ of distributions has ever-higher entropy, and with luck it’ll converge towards the probability distribution with maximum entropy. (In the notation above, this sequence is $f, f_2, f_4, f_8, \ldots$.) The classical version of the Central Limit Theorem says that it converges towards the normal distribution. The entropy version says that the distribution with maximum entropy is the normal distribution.

Probably someone here knows to what extent this sketch is accurate. For example, is $\Phi$ in any sense a contraction, so that the Central Limit Theorem is an instance of the Contraction Mapping Theorem?

All of the above is apropos of Bruce’s comment:

When people talk about random walks, they appear to talk about the concept of a ‘limiting distribution’ — some kind of notion about the limiting behaviour of the paths as the number of steps go to infinity. Is this “limiting distribution” the weight distribution?

Your position in a random walk is the sum of the steps you’ve taken so far, so this looks to me something like the Central Limit Theorem. But the distribution of positions is a distribution on whatever ‘park’ you’re walking in: so if the park isn’t $\mathbb{R}$, you can’t use the usual Central Limit Theorem. I guess probabilists have generalizations of the Central Limit Theorem to spaces other than $\mathbb{R}$, though…?

Anyway, Bruce, I agree with you. The weight distribution on a finite metric space is so canonical that it ought to be in some sense a limiting distribution. I hope there’s a Central Limit Theorem for finite metric spaces that makes this precise.

Posted by: Tom Leinster on March 16, 2009 5:24 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

When I wrote ‘as we discussed before’ I tried to link back to our previous discussion of maximum entropy, but I guess I messed up the code.

Posted by: Tom Leinster on March 16, 2009 5:35 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Speaking of the central limit theorem, I’ve been working with stable distributions lately that come with a generalized central limit theorem. My spidey senses tell me there is a lot of deep stuff going on so I thought I would throw it into the ring to see if it resonates with anything you are thinking about. Does it? How might stable distributions come into the big picture?

Posted by: Eric on March 16, 2009 6:16 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

In answer to Tom’s question about the operator $\Phi$ being a contraction on the space of all distributions (that is $H(f) \leq k H(\Phi(f))$ for some absolute $k$ less than 1:

1. It’s true for $k = 1$.

2. I’m not aware of any such results with $k$ less than 1 (and I suspect that you could cook up examples based on the gamma distribution to show that it isn’t possible).

3. $\Phi$ is a contraction on a smaller space – for example Artstein, Ball, Barthe and Naor (or myself and Barron) show that it is a contraction on the set of probability distributions with bounded Poincare constant (so finite moments of all order). It would be nice to extend the space, to only require finite 4th moment (say), but I don’t know how to do it.

4. Another paper of ABBN makes the parallel with the Second Law of Thermodynamics clearer, by proving the very nice result that the entropy of $f_k$ is monotone in $k$, not just along the powers-of-2 subsequence.

There’s something curious which is that all these theorems tend not to be proved in terms of entropy, but rather using Fisher information. (The $L^2$ structure involved there is much easier to handle than working with logarithms). You can (in a certain sense) integrate up Fisher informations to get entropy.

Now, the corresponding version of the max-entropy property of the normal is that the Fisher information has minimum value $1/\sigma^2$ (where $\sigma^2$ is the variance), achieved by the normal. This is actually an uncertainty principle. So, in some sense I’ve never really understood, we can integrate up uncertainty principles to get a maximum entropy relation!

Posted by: Olly Johnson on March 16, 2009 4:58 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Thanks for that, Olly.

In what you wrote about $\Phi$ being a contraction, I wonder if there’s a typo, or if we’re talking at cross-purposes? You wrote ‘$H(f) \leq k H(\Phi(f))$’ where $k \le 1$. Did you mean to write something like $d(\Phi(f), \Phi(g)) \leq k d(f, g)$ where $d$ is some kind of distance between distributions?

Or if you meant to write what you wrote, can you explain how it relates to $\Phi$ being a contraction in the sense of the displayed inequality?

Posted by: Tom Leinster on March 16, 2009 8:55 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

You are entirely right, of course - apologies for that!

I was thinking in terms of the relative entropy $d(f,g) = \int f(x) \log (f(x)/g(x))dx$. In particular, in terms of the relative entropy distance from $f$ to $\phi$ (a normal with variance $\sigma^2$, the same variance as $f$). This satisfies$d(f, \phi) = 1/2 \log(2 \pi e \sigma^2) - H(f).$

The fact that $d$ is positive gives us the proof that the normal has the maximum entropy property for fixed variance, by the way.

Most of the results I mentioned above are best expressed in terms of controlling $d(f,\phi)$ (the distance to the fixed point). However, the fact that $d$ isn’t a metric will make your contraction property, controlling $d(\Phi(f),\Phi(g))$ in terms of $d(f,g)$ hard to prove.

Annoyingly, I can’t find a counterexample though! (I now think gamma distributions may not be the place to look, but Pareto laws with shape parameter arbitrarily close to 2 might be).

Posted by: Olly Johnson on March 17, 2009 1:06 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Late at the party but…

Iterating Φ does not produce the full sequence $(f_n)$, only its subsequence $(f_{2^n})$ hence the proof you propose cannot deal as it is with the full sequence.

As a matter of fact I seem to remember that the monotony argument you outline fails for the full sequences, in general (but this would have to be checked).

Posted by: Didier Piau on September 13, 2013 9:36 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Maybe a reversible Markov chain approach will shed some light on this. Suppose $p_{ij}$ are the transition probabilities, and $\pi_i$ the invariant distribution of a Markov chain $X$. “$X$ is reversible” means that $X$ satisfies the detailed balance equations $p_{ij} \pi_i = p_{ji} \pi_j$. Now define $d_{ij}$ by $\exp(-d_{ij}) = p_{ji} / \pi_i$. Then the detailed balance equations ensure that $d$ is symmetric.

Necessarily, $\sum_i \pi_i \exp(-d_{ij}) = 1$, so the weights are the invariant distribution.

This suggests that the weights are the invariant distribution of some particle process on the weighted graph $d_{ij}$, but I haven’t worked out what this process is.

Posted by: Tom on March 16, 2009 6:01 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2) Another peanut to throw from the gallery: how are we to think of these negative probabilities? If we open up books on random walks, you’ll never find someone talking about the probability for going from $a$ to $b$ being a negative number — but that’s precisely what occurs in some metric spaces.

I am convinced Feynman’s essay on “negative probability” has something useful to say on this point.

Posted by: Bruce Bartlett on March 15, 2009 5:55 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

If (like me) you got frustrated by Google Books’s tantalizing omission of crucial pages, you might like this link to an unexpurgated copy of Feynman’s essay.

Posted by: Tom Leinster on March 16, 2009 12:22 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Hey, why not assign complex transition probabilities?—then you’d be doing quantum field theory.

Posted by: Tim Silverman on March 16, 2009 12:49 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Hmm, I didn’t really get much from Feynman’s essay. It can’t have helped that I skipped all the physics.

All I understood from it was this: don’t be afraid of negative probabilities. When they appear in intermediate stages of a calculation, they’re perfectly valid — just as negative numbers usually are, even if you’re doing something like ‘number of apples’. And when a negative probability appears as a final answer, it doesn’t necessarily mean that something’s wrong; think a bit harder.

Does anyone else have any insight into negative probabilities?

Posted by: Tom Leinster on March 16, 2009 12:52 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

Scott Aaronson has written a bit on the subject.

Posted by: Blake Stacey on March 16, 2009 2:51 AM | Permalink | Reply to this

### Feynman is God; Re: Entropy, Diversity and Cardinality (Part 2)

I did not invent Negative Probabilities for QM. But it was I who first told Feynman (1968-1970) at Caltech. He replied “That’s totally insane. But not obviously wrong…” and he went off and wrote a paper on the subject.

Posted by: Jonathan Vos Post on March 18, 2009 7:32 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 2)

The ordinary quantum mechanics can be restated in terms of so-called Wigner function, instead in terms of wave function. Usual wave function depends only on coordinates, or only momenta, or only some third representation, because uncertainty principle does not let it correspond to a point in classical phase space. Wigner function on the contrary is classical in the sense that it attaches to points in phase space. The price to pay is that the probabilities in terms of Wigner function can be negative! But in many examples Wigner function shows behaviour closer to classical mechanics than the wave function description. Wigner’s article is 1932 and the whole fact is fact about Fourier expansions, the integral kernel corresponds to a choice of the ordering. Wigner’s function is related to Weyl ordering (now we talk 1927) and Moyal star product. The time independent version satisfies a “star-genvalue equation”.

Now after I said that, I see there is a reasonable wikipedia article though if you like physics I recommend papers by Cutright, Fairlie and Zachos on the topic and their overview in the book Quantum mechanics in phase space.

Posted by: Zoran Skoda on March 19, 2009 4:35 AM | Permalink | Reply to this

### Maximizing diversity

I think I’ve solved the maximum diversity problem.

The maximum diversity problem is this:

Given a metric space and $\alpha   >   0$, which distribution(s) $\mathbf{p}$ on the space maximize the $\alpha$-diversity? What is the maximum value?

Since $\alpha$-entropy, $\alpha$-diversity and $\alpha$-cardinality are all related by order-preserving, invertible transformations, it is equivalent to pose the problem in terms of entropy or cardinality.

The big surprise is that

the answers are independent of the value of $\alpha$.

So although different diversities $D_\alpha$ tend to make different judgements on which distributions are more diverse than which others, there is a single distribution that is maximally diverse for all of them simultaneously. And more generally, if a distribution maximizes $\alpha$-diversity for some $\alpha > 0$, then it maximizes $\alpha$-diversity for all $\alpha > 0$. It is unambiguously maximizing.

To make it true that the maximum value is independent of $\alpha$, we have to switch from diversity to cardinality. So, the result is that the maximal $\alpha$-cardinality is independent of $\alpha$.

Beware that I’ve shifted terminology: $\begin{matrix}   \\ This post (including this comment)&& The paper \\ \alpha && q \\ cardinality | |_\alpha&& diversity D_q \\ diversity D_\alpha && entropy H_q \\ entropy H_\alpha && —\\ \end{matrix}$ (The reason is that after I’d written this post, I had a conversation with a biologist, and did some more reading, and came away convinced that I’d used words in a bad way. Sorry for the confusion, but it needed to be done.)

Posted by: Tom Leinster on October 7, 2009 5:34 PM | Permalink | Reply to this

I linked to your wonderful arXiv paper from my facebook page (since I have over 400 “friends” there) stating:

Ecology and Math, together again. “The entropies that we consider can be used as measures of biodiversity. In that context, the question is: for a given collection of species, which frequency distribution(s) maximize the diversity? The theorem provides the answer. The chief surprise is that although we are dealing not just with a single entropy, but a one-parameter family of entropies, there is a single distribution maximizing all of them simultaneously.”

Posted by: Jonathan Vos Post on October 7, 2009 6:54 PM | Permalink | Reply to this

### Re: Maximizing diversity

Doh! How did I let you escape from Sheffield without explaining the proof to me first?

Posted by: Simon Willerton on October 7, 2009 10:34 PM | Permalink | Reply to this

### Re: Maximizing diversity

I left early in the morning, when no one was expecting it.

Here are what I see as the main points of the proof. I’ll use the terminology of the paper, as opposed to that of the original post.

1. Figure out how to maximize $D_0$, diversity of order zero. This is harder than it sounds, because $D_0$ isn’t continuous on the space $\Delta_n$ of probability distributions. It is continuous on the interior of $\Delta_n$, and we can use calculus to help us there; but things get tricky around the boundary.
2. Show that diversity decreases with its order. That is, for a given similarity matrix $Z$ and distribution $\mathbf{p}$, the diversity $D_q^Z(\mathbf{p})$ of order $q$ decreases with $q$.
3. Classify the ‘invariant’ distributions, as in Fact 2 of the post above. (This is where the links to weightings and cardinality are made.) It turns out that maximizing distributions are always invariant.

Point 2 somehow lets us avoid ever working with the formula for $D_q$ except in the cases $q = 0$ and $q = \infty$.

There are various complicating factors in the proof. I hope they’ll be handled more smoothly in some future version.

Posted by: Tom Leinster on October 7, 2009 11:33 PM | Permalink | Reply to this

### Re: Maximizing diversity

Question from the peanut gallery…

Do you ever let yourself get distracted by wondering about the “meaning” of this result? I didn’t follow completely, but it struck a chord and my mind started racing to places that I’ll avoid enunciating here for the moment :)

Do you ever think about possible philosophical, natural, universal meanings of this? Or is it just a curious math problem? Is there a bigger picture lurking behind the scenes?

The concept of diversity obviously has applications in biology, but I can’t help think that it goes even deeper than that.

Posted by: Eric Forgy on October 7, 2009 11:32 PM | Permalink | Reply to this

### Re: Maximizing diversity

I sense you have something in mind… I’d be interested to know what it is!

At the moment I don’t think I understand the implications of this result — or more specifically, the ‘surprising’ fact that the maximizing distribution doesn’t depend on the parameter $q$ (a.k.a. $\alpha$) that you’ve chosen. I wish I could tell some story that made it seem plausible, but I can’t.

Certainly the concept of diversity isn’t specific to ecology. Perhaps it would be helpful to start thinking about it in some other settings, in the hope of finding an angle from which the ‘surprise’ seems unsurprising.

Posted by: Tom Leinster on October 8, 2009 2:21 AM | Permalink | Reply to this

### Maximizing communication efficiency

Hi Tom,

one can’t but notice that evidently you are seriously interested in exposing substantial technical material to an online public here.

As one of your readers who is busy with plenty of other things but very interested in following with at least half an eye what you are developing here, I can give you the empirical observation that it is hard to follow the details.

Don’t misunderstand me: I am sure if I sat down and went carefully through your various posts and comments, I could bring myeself back to the point where I know precisely what it is you are talking about.

But, being busy myself with lots of other things, I don’t do that. I expect many other potential users of your stuff will feel similarly. Scattered blog posts/comments (and they are inevitably scattered) just don’t happen to be a good source for coherently taking in nontrivial chunks of information. They are good for carrying pointers to other sources that do.

And now comes the inevitable line that you are all grinningly awaiting me to say, but I’ll say it anyway:

we created the $n$Lab for precisely situations like this. For such developing and exposing of ideas and work just as you are doing here now.

I, as one of your potential readers, wish at this point that the definitions, lemmas and proofs that you are talking about here were collected and developed on some $n$Lab page. Because that way I would know exactly where to go to find collected all the information I need to remind myself to bring me up to speed.

I wouldn’t say that if I had the impression that you’d rather not expose unpublished research material to the public. But since you are anyway going through the considerable trouble of typing all this information here into blog comments, I see that you have lots of energy and intention to do just that, but then I find it unfortunate that it seems that with exactly the same energy invested you could have a much more efficient result, as far as reader-response goes, if you typed all this into an $n$Lab entry instead. Of course this could be done in parallel to the announcements here, so that we have both: update alerts and a place where the material develops stably and coherently.

Have you thought about that? If you did and have reasons for not doing it, I’d be seriously interested in hearing them. Maybe we can do something about it!

Posted by: Urs Schreiber on October 8, 2009 2:02 PM | Permalink | Reply to this

### Re: Maximizing communication efficiency

Posted by: Tom Leinster on October 12, 2009 12:12 PM | Permalink | Reply to this

### Re: Maximizing diversity

Looks good Tom! I wonder if there’s any way of understanding the normal distribution $N(0, 1)$ on $\mathbb{R}$ as a maximum entropy distribution.

I know your idea of taking a sequence of finite sets of a metric space, whose limit is dense, only should work for compact sets. Might there be a way of putting a distance on $[-n, n]$ such that the maximally diverse distribution is close to a normal distribution, and such that in the limit the normal distribution appears.

There is that idea of – what was it? – the normal distribution arising from the limit of projections of the uniform distribution over the $n$-sphere of radius $\sqrt n$ onto 1 dimension.

Then perhaps you could see the normal distribution as ‘uniform’ for a certain space.

Posted by: David Corfield on October 8, 2009 5:30 PM | Permalink | Reply to this

### Re: Maximizing diversity

Yeah. The inverse problem is interesting too. Since we have discrete species, it might be better to ask about binomial distributions (which go to the normal distribution in the continuum limit).

For what kind of species would a binomial distribution be the maximum diversifying?

Posted by: Eric Forgy on October 8, 2009 6:56 PM | Permalink | Reply to this

### Re: Maximizing diversity

Then perhaps you could see the normal distribution as ‘uniform’ for a certain space.

I forget the details (and don’t feel like working them out right now - differential geometry is not my strong suit), but there’s some pretty straightforward Riemannian metric on $\mathbb{R}^n$ such that the induced volume is Gaussian measure. I guess it must be induced by some natural homeomorphism between $\mathbb{R}^n$ and $S^{n-1}\setminus \{*\}$. So this suggests the “certain space” should just be the circle.

Posted by: Mark Meckes on October 8, 2009 6:58 PM | Permalink | Reply to this

### Re: Maximizing diversity

Coincidentally, I am listening to a live (free) online conference sponsored by Matlab.

The current (live) talk, “Managing Diversification”, is discussing entropy and diversification. Very interesting. Have a look :)

Note: This is NOT what I had in mind, but the relations between physics and finance help explain my career change.

Posted by: Eric Forgy on October 8, 2009 5:37 PM | Permalink | Reply to this

### Re: Maximizing diversity

In case anyone is interested, here is a link to the paper:

Managing Diversification

Attilio Meucci Bloomberg ALPHA, Portfolio Analytics and Risk

March 12, 2009

Abstract: We propose a unified, fully general methodology to analyze and act on diversification in any environment, including long-short trades in highly correlated markets. First, we build the diversification distribution, i.e. the distribution of the uncorrelated bets in the portfolio that are consistent with the portfolio constraints. Next, we summarize the wealth of information provided by the diversification distribution into one single diversification index, the effective number of bets, based on the entropy of the diversification distribution. Then, we introduce the mean-diversification efficient frontier, a diversification approach to portfolio optimization. Finally, we describe how to perform mean-diversification optimization in practice in the presence of transaction and market impact costs, by only trading a few optimally chosen securities. Fully documented code illustrating our approach can be downloaded from MATLAB Central File Exchange

Keywords: entropy, mean-diversification frontier, transaction costs, market impact, selection heuristics, systematic risk, idiosyncratic risk, principal component analysis, principal portfolios, r-square, risk contributions, random matrix theory

JEL Classifications: C1, G11 Working Paper Series

Posted by: Eric Forgy on October 8, 2009 5:51 PM | Permalink | Reply to this

### Re: Maximizing diversity

By the way, the relationship to Tom’s work (in my opinion) is that various types of financial risk can be thought of as various “species”. To diversify a portfolio, you want as many distinct “financial species” as possible.

The author also talks about “effective number of bets” which reminds me of Tom’s stuff on cardinality. Even if you have made many different investments, those may all be the same “financial species” so you are not actually diversified and the effective number of bets (cardinality) is “1”. If all your $N$ investments are distinct “financial species” the effective number of bets is “$N$”.

Posted by: Eric Forgy on October 8, 2009 6:12 PM | Permalink | Reply to this

### Re: Maximizing diversity

Oh, that’s interesting. (I somehow missed your comments at the time. It was my birthday.) Thanks.

There’s another ‘effective number’ that might be related: effective number of samples in statistics. The idea is that if, say, you want to find the average height of men in Borneo, and you only have the resources to take 10 samples, then you’d be very unwise to take all your samples from the same extended family. Instead, of course, you should choose your 10 men in as independent a way as possible.

In the case of taking them all from the same family, your ‘effective’ number of samples would be much less than 10. If you choose the 10 men in a very independent way then your effective sample size is close to 10.

Formally, what this amounts to is replacing the similarity matrix $Z$ with the matrix of correlation coefficients. This matrix is always positive semidefinite (which is convenient) and its entries lie in $[0, 1]$. The new thing, which we don’t see when measuring the cardinality of a metric space or category, is that some entries might be negative.

Posted by: Tom Leinster on October 22, 2009 7:09 PM | Permalink | Reply to this

### Re: Maximizing diversity

Well happy belated birthday! :)

Are you sure about the range? Correlation coefficients are generally in the range $[-1,1]$.

By the way, I was pretty happy with a recent “phynance” post of mine I’ve contemplated writing up and submitting to a journal for a few years now, but decided to go ahead and blog it.

Einstein meets Markowitz: Relativity Theory of Risk-Return

There, the covariance matrix (correlation is derived from covariance) plays the role of the metric on a risk manifold.

I am becoming more suspicious that your weights are related to a metric. Here are some crude notes I’m working on:

Cardinality

My gut tells me all these concepts are closely related.

Posted by: Eric Forgy on October 22, 2009 7:37 PM | Permalink | Reply to this

### Re: Maximizing diversity

Yes, sorry: when I wrote ‘$[0, 1]$’ I meant ‘$[-1, 1]$’. Hence the reference to negativity in the next sentence.

Posted by: Tom Leinster on October 22, 2009 8:52 PM | Permalink | Reply to this

### Re: Maximizing diversity

When the similarity matrix is positive definite with semi-positive entries, the inverse matrix contains negative entries. I wonder if the inverse could be interpreted as a covariance matrix or Gram matrix? Sorry. Just thinking out loud…

Posted by: Eric Forgy on October 22, 2009 10:40 PM | Permalink | Reply to this
Read the post The 1000th Post on the n-Category Café
Weblog: The n-Category Café
Excerpt: Meet our new hosts: Alex Hoffnung, Tom Leinster, Mike Shulman and Simon Willerton!
Tracked: November 13, 2009 3:32 AM
Read the post Magnitude of Metric Spaces: A Roundup
Weblog: The n-Category Café
Excerpt: Resources on magnitude of metric spaces.
Tracked: January 12, 2011 1:21 PM

Post a New Comment