## October 16, 2008

### Entropy, Diversity and Cardinality (Part 1)

#### Posted by David Corfield

Guest post by Tom Leinster

This is the first of two posts about

The connection is provided by that important and subtle notion, entropy.

The ideas I’ll present depend crucially on the insights of two people. First, André Joyal explained to me the connection between cardinality and entropy. Then, Christina Cobbold told me about the connection between entropy and biodiversity, and suggested that there might be a direct link between measures of biodiversity and the cardinality of a metric space. She was more right than she knew: it turns out that the cardinality of a metric space, which I’d believed to be a new concept coming from enriched category theory, was discovered 15 years ago by ecologists!

Outline

Suppose you’re a scientist investigating the impact of human activities on the biodiversity of some particular ecosystem: say, the forests of Indonesia. To do this rigorously, you’ll need some way of quantifying biodiversity. In other words, you’ll need a way of taking a raw mass of ecological data and turning it into a single number.

Being an open-minded and well-educated scientist, you’d be happy in principle if your ‘number’ lay in some number system of a non-traditional kind (e.g. an abstract rig), but you see that there are certain advantages to sticking with the reals. For example, you can meaningfully say things like ‘the biodiversity of the Indonesian forests has fallen by 23% in 10 years’. We’ll stick to real values here.

It turns out that there are many sensible measures of biodiversity — ecologists have been debating their relative merits for years. Two of the most important aspects of an ecosystem that you might want a diversity measure to reflect are:

1. Abundance:  the proportions in which the species occur (e.g. 50% grass, 30% clover, 20% daisies)
2. Similarity:  the extent to which the species are related (e.g. an ecosystem made up of 50% snails and 50% slugs should probably be regarded as less diverse than one made up of 50% snails and 50% bats).

This first post will be about (1) only. We’ll look at a family of diversity measures taking only abundance into account, and use it to explore notions of entropy and cardinality.

The second post will be about (1) and (2) together. There, metric spaces will make their entrance.

Diversity

Diversity is a widely applicable concept: just as you can talk about diversity of species in an ecosystem, you could also talk about diversity of types of rock on a mountain, words in a novelist’s work, etc. Nevertheless, I’ll continue to discuss it in the ecological setting. This is partly because it’s what I’ve thought about most, partly because biodiversity is important, and partly because it lends itself well to vivid imagery.

So let’s imagine an ecosystem in which $n$ species occur, in proportions $p_1, \ldots, p_n$ respectively. Thus, $p_1 + \cdots + p_n = 1$. I’ll refer to a finite family of non-negative reals summing to $1$ as a finite probability space. (This is a slight abuse of terminology, but never mind.) In the simple example above where the ecosystem consisted of just grass, clover and daisies, we had $n = 3$ and $(p_1, p_2, p_3) = (0.5, 0.3, 0.2)$.

Some decision has to be made about how exactly the proportions, or ‘relative abundances’, $p_i$ are measured. It could be done according to the number of individuals of each species, or the total mass of each species (so that an ant counts for less than an antelope), or any other measure thought to be helpful. I’ll assume that this decision has been made.

Our task now is to turn the probability space $(p_1, \ldots, p_n)$ into a single real number, representing the ‘diversity’ of the system. Here are the three ways of doing it most popular among ecologists.

0. Species richness  Just count the number of species present. It’s best to then subtract $1$, since an ecosystem containing only one species (e.g. a field containing only wheat) is usefully thought of as having zero diversity. (Ecosystems containing no species at all are off the scale; recall the axiom that $\sum p_i = 1$.) So with the notation above, the species richness is defined to be $n - 1$.

This measure is not only crude, but also, statistically, very sensitive to sample size. In an ecosystem where most species are rare, even a large sample may fail to detect many species and so drastically underestimate the species richness.

1. Shannon entropy  Its relevance to ecology has been described as ‘tenuous’; nevertheless, it is one of the most widely-used measures of biodiversity. The Shannon entropy, or information entropy, or information diversity, of the probability space $(p_1, \ldots, p_n)$ is $-\sum_{i = 1}^n p_i \log p_i.$ We use the convention that $x \log x = 0$ when $x = 0$, since $lim_{x \rightarrow 0} x \log x = 0$. (Alternatively, as every category theorist knows, $0^0 = 1$, so $0\,\log\,0 = \log\, 0^0 = \log\, 1 = 0$.)

I’ll have much more to say about entropy later. For now, let’s just record some of its basic properties.

First, Shannon entropy is always non-negative. Second, it’s zero if and only if some $p_i$ is $1$ and the rest are $0$. In other words, when the ecosystem is made up entirely of one species, diversity is zero. Third, entropy/diversity is maximized (for a fixed $n$) when all the species occur in equal abundance: $p_1 = \cdots = p_n = 1/n$. In that case, the entropy is $\log n$.

2. Simpson diversity  Another simple diversity measure born in the 1940s is Simpson diversity, $1 - \sum_{i = 1}^n p_i^2.$ Like Shannon entropy, it’s always non-negative, it’s zero if and only if some $p_i$ is $1$, and it’s maximized (for fixed $n$) when $p_1 = \cdots = p_n = 1/n$. In that case, its value is $1 - 1/n$; and as we would expect of a measure of diversity, this is an increasing function of $n$.

Simpson diversity has the advantage of being quadratic, which makes it amenable to methods of multilinear algebra. It also has certain statistical advantages (such as the existence of an unbiased estimator). And as these notes point out, it’s the probability that two randomly-chosen individuals are of different species.

I learned about diversity measures from

Carlo Ricotta, Laszlo Szeidl, Towards a unifying approach to diversity measures: Bridging the gap between the Shannon entropy and Rao’s quadratic index, Theoretical Population Biology 70 (2006), 237–243

and Chapter 7 of

Russell Lande, Steinar Engen, Bernt-Erik Sæther, Stochastic Population Dynamics in Ecology and Conservation, Oxford University Press, 2003

(both courtesy of Christina Cobbold). Unfortunately there seems to be no free copy of either online, but the Wikipedia article on diversity measures gives some basic information.

Bringing them all together  The crucial observation now is that all three of these useful diversity measures are members of a continuous, one-parameter family of measures.

To understand this, it can help to think in terms of ‘surprise’. How surprised would you be if you found an ant on an ant farm? Not surprised at all. How surprised would you be if you found a Yangtze river dolphin in the Yangtze? Sadly, you should be extremely surprised. A surprise function $\sigma$ assigns to each probability $p \in [0, 1]$ a degree of surprise $\sigma(p) \in [0, \infty]$; we require $\sigma$ to be decreasing (so that more probable events are less surprising) and satisfy $\sigma(1) = 0$ (so that the occurrence of an event of probability $1$ is no surprise at all).

From any surprise function, you can obtain a measure of diversity. If the surprise function is called $\sigma$, the diversity measure assigns to a probability space $(p_1, \ldots, p_n)$ the quantity $\sum_{i = 1}^n p_i \sigma(p_i)$ — the expected surprise. ‘Expected surprise’ might sound paradoxical, but it’s not. How surprised do you expect to be tomorrow? Personally, I expect to be mildly surprised but not astonished: that’s what most of my days are like. In an ecosystem containing only one species, you’ll never be at all surprised at what you find, so your expected surprise is $0$; correspondingly, $\sum p_i \sigma(p_i) = 0$. On the other hand, imagine picking individuals at random from an ecosystem containing $10$ species in equal proportions: then you’ll always be a bit surprised at what you find. (Sometimes, as here, the ‘surprise’ metaphor seems a bit strained. You can think instead of related concepts such as unpredictability, rarity, or information content.)

Now let’s define that one-parameter family of diversity measures. First define, for each $\alpha \geq 0$, a surprise function $\sigma_\alpha: [0, 1] \to [0, \infty]$ by $\sigma_\alpha(p) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} (1 - p^{\alpha - 1}) & if \alpha \neq 1 \\ - \log p & if \alpha = 1. \end{matrix} \right.$ The reason for the second clause is that the first doesn’t make sense when $\alpha = 1$, but $\lim_{\alpha \to 1} \frac{1}{\alpha - 1} (1 - p^{\alpha - 1}) = - \log p$ (by l’Hôpital’s rule, or by evaluating $\lim_{r \to -1} \int_1^x t^r\, d t$ in two different ways.) The resulting diversity measure $D_\alpha$ is given by $D_\alpha (p_1, \ldots, p_n) = \sum_{i = 1}^n p_i \sigma_\alpha(p_i) =   \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - \sum p_i^\alpha\right) & if \alpha \neq 1 \\ - \sum p_i \log p_i & if \alpha = 1. \end{matrix} \right.$ In particular,

• $D_0(p_1, \ldots, p_n) = n - 1$, the species richness
• $D_1(p_1, \ldots, p_n) = - \sum p_i \log p_i$, the Shannon entropy
• $D_2(p_1, \ldots, p_n) = 1 - \sum p_i^2$, the Simpson diversity.

The measures $D_\alpha$ have good basic properties, at least for $\alpha > 0$. We always have $D_\alpha(p_1, \ldots, p_n) \geq 0$, with equality if and only if some $p_i$ is $1$. For fixed $n$, $D_\alpha(p_1, \ldots, p_n)$ is maximized when $p_1 = \cdots = p_n = 1/n$, and in that case its value is $\sigma_\alpha(1/n) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - n^{1 - \alpha}\right) & if \alpha \neq 1 \\ \log n & if \alpha = 1. \end{matrix} \right.$ These expressions aren’t very convenient. We’ll fix that later.

This one-parameter family of measures appears to have been discovered several times over. According to the paper of Ricotta and Szeidl cited above, it was discovered in information theory —

J. Aczél, Z. Daróczy, On Measures of Information and their Characterizations, Academic Press (1975)

— then independently in ecology —

G.P. Patil, C. Taillie, Diversity as a concept and its measurement, Journal of the American Statistical Association 77 (1982), 548–567

— and then again independently in physics —

Constantino Tsallis, Possible generalization of Boltzmann–Gibbs statistics, Journal of Statistical Physics 52 (1998), 479–487.

(I haven’t looked up these sources.) Accordingly, people in different disciplines attribute it differently - e.g. physicists seem to call it Tsallis entropy.

Sometimes these diversity measures $D_\alpha$ are referred to as ‘entropy’ of degree $\alpha$. But I want to reserve the term ‘entropy’, as I’ll explain in a moment.

Entropy

There are many related quantities called ‘entropy’. The notion appears in physics, communications engineering, statistics, linguistics, dynamical systems, …, as well as ecology; it can be thought of as measuring disorder, information content, uncertainty, uniformity, diversity, ….

Stick out your arm in the $n$-Category Café and you’ll knock over the coffee of someone who knows more about entropy than I do. Witness, for instance, this learned conversation of Ben Allen and Chris Hillman; David, in his machine learning days, used exotic-sounding related concepts such as Kullback–Leibler divergence; John and Urs doubtless know all about entropy in physics. But for now, I’ll stick humbly to the example above: the Shannon entropy $-\sum p_i \log p_i$ of a finite probability space $(p_1, \ldots, p_n)$.

An important property of Shannon entropy is that it is log-like. In other words, let $A = (p_1, \ldots, p_n)$ and $B = (q_1, \ldots, q_m)$ be finite probability spaces. There is an obvious ‘product’ space, $A \times B = (p_1 q_1, \ldots, p_1 q_m, \ldots, p_n q_1, \ldots, p_n q_m),$ and the log-like property is this: writing $H$ for Shannon entropy, $H(A \times B) = H(A) + H(B).$ This is one of the things that makes $H = D_1$ more convenient than the other diversity measures $D_\alpha$ defined above: it is only for $\alpha = 1$ that this property holds. I’ll reserve the word ‘entropy’ for log-like measures.

(The proof that $H$ is log-like is a harmless exercise, depending on two things. One is that these are probability spaces: $\sum p_i = 1$. The other is that the function $f: x \mapsto -x\log x$ is a derivation: $f(x y) = f(x)y + x f(y)$.)

Aside: a question  Here’s something I’d like to understand. Given probability spaces $A$ and $B$ as above, and given $\lambda, \mu \geq 0$ such that $\lambda + \mu = 1$, there’s a new probability space $\lambda A + \mu B = (\lambda p_1, \ldots, \lambda p_n, \mu q_1, \ldots, \mu q_m)$ and we have $H(\lambda A + \mu B) = -(\lambda \log\lambda + \mu \log\mu) + \lambda H(A) + \mu H(B).$ More generally, there is a symmetric operad $C$ given by $C_n = \Delta^{n - 1} = \{ (p_1, \ldots, p_n) \in \mathbb{R}^n : p_i \geq 0, \sum p_i = 1 \}$ with composition as follows: if $A = (p_1, \ldots, p_n),   A_1 = (p_{1, 1}, \ldots, p_{1, k_1}),   \ldots,   A_n = (p_{n, 1}, \ldots, p_{n, k_n})$ then $A \circ (A_1, \ldots, A_n) = (p_1 p_{1, 1}, \ldots, p_1 p_{1, k_1}, \ldots, p_n p_{n, 1}, \ldots, p_n p_{n, k_n}).$ (A $C$-algebra might be called a ‘convex algebra’; for instance, any convex subset of a real vector space is naturally one.) We have $H(A \circ (A_1, \ldots, A_n)) = H(A) + \sum_{i = 1}^n p_i H(A_i).$ The question is: what’s going on here, abstractly? I’m imagining restating this equation in such a way that there is no mention of the $p_i$s, and thus, perhaps, finding a good characterization of entropy in operadic terms.

Cardinality

André Joyal explained to me that he likes to think of entropy as something like cardinality. More precisely, the exponential of entropy is like cardinality. To test out this point of view, let’s write $|A| = e^{H(A)} = \prod_{i = 1}^n p_i^{-p_i}$ for any finite probability space $A = (p_1, \ldots, p_n)$, and call it the cardinality of $A$.

(There are slightly different conventions on Shannon entropy. Because of its applications to digital communication, some people like to take their logarithms to base 2; others use base $e$, as I’m doing here.

Cardinality is independent of this choice.)

Now let’s translate the basic properties of entropy into properties of cardinality, to see if cardinality deserves its name. Since entropy is always non-negative, we always have $|A| \geq 1$. We have $|A| = 1$ if and only if some $p_i$ is $1$ and the rest are $0$, a situation that can be interpreted as

there is effectively only one species.

For a fixed $n$, the cardinality is maximized when $p_1 = \cdots = p_n = 1/n$, and in that case $|A| = n$; this situation can be interpreted as

all $n$ species are fully present.

Finally, the log-like property of entropy translates as $|A \times B| = |A| \times |B|$ — cardinality is multiplicative.

This is all very satisfactory. Our ‘cardinality’ has the properties that one might intuitively hope for. Furthermore, it corresponds very closely to a well-known and useful measure of diversity, the Shannon entropy $H = D_1$. But this is not the only useful measure of diversity: there are, at least, all the other measures $D_\alpha$ ($\alpha \geq 0$). Is there a corresponding notion of ‘$\alpha$-cardinality’ for every $\alpha$, with the same good properties?

The answer is yes, but that’s not quite obvious. It’s no use defining the $\alpha$-cardinality of $A$ to be $e^{D_\alpha(A)}$: for since $D_\alpha$ is not log-like except when $\alpha = 1$, this $\alpha$-cardinality would not be multiplicative. So that wouldn’t be a useful definition.

Let’s take stock. We’ve fixed an $\alpha \geq 0$, and we’re trying to define, for each finite probability space $A = (p_1, \ldots, p_n)$, its ‘$\alpha$-cardinality’ $|A|_\alpha$. We want to do this in such a way that:

1. $|A|_\alpha$ is a function (preferably invertible) of $D_\alpha(A)$
2. $1 \leq |A|_\alpha \leq n$
3. $|A|_\alpha = 1$ if some $p_i$ is $1$
4. $|A|_\alpha = n$ if $p_1 = \cdots = p_n = 1/n$
5. $|A \times B|_\alpha = |A|_\alpha \times |B|_\alpha$.

I’ll skip some elementary steps here, but it’s not hard to see that these requirements (in fact, (1) and (4) alone) pretty much force the answer on us. It turns out that we need $D_\alpha$ and $|  |_\alpha$ to be related by the equation $D_\alpha(A) = \left\{ \begin{matrix} \frac{1}{\alpha - 1} \left(1 - |A|_\alpha^{1 - \alpha}\right) & if \alpha \neq 1 \\ \log |A|_1 & if \alpha = 1, \end{matrix} \right.$ and a small amount of elementary algebra then gives us the definition: $|A|_\alpha = \left\{ \begin{matrix} \left( \sum_{i = 1}^n p_i^\alpha \right)^{\frac{1}{1 - \alpha}} & if \alpha \neq 1 \\ \prod_{i = 1}^n p_i^{-p_i} & if \alpha = 1, \end{matrix} \right.$ the $\alpha$-cardinality of the finite probability space $A = (p_1, \ldots, p_n)$.

(The 1-cardinality is just the cardinality. Here at the $n$-Category Café, we’re used to the convention that ‘1-widget’ means the same as ‘widget’.)

It’s easy to confirm that properties (1)–(5) do indeed hold for $\alpha$-cardinality, for every $\alpha \geq 0$.

Example: $\alpha = 0$  The diversity measure $D_0$ is very simple: $D_0(A) = n - 1$. (Recall that the motivation for subtracting $1$ was to make a one-species ecosystem have diversity zero.) The $0$-cardinality of $A$ is just $n$, the number of species — obviously a useful quantity too!

Example: $\alpha = 1$  This is the motivating example: $D_1$ is the Shannon entropy, and $1$-cardinality is cardinality.

Example: $\alpha = 2$  We saw that $D_2$ is Simpson diversity: $D_2(A) = 1 - \sum p_i^2$. The $2$-cardinality of $A$ is $|A|_2 = 1/\sum p_i^2,$ which is also often used as a measurement of diversity (and also sometimes called Simpson diversity).

Example: $\alpha \to \infty$  It’s easy to show that for any probability space $A$, we have $\lim_{\alpha \to \infty} D_\alpha(A) = 0$. However, if we work with cardinality rather than diversity, something more interesting happens: $\lim_{\alpha\to\infty} |A|_\alpha = 1/\max_{1 \leq i \leq n} p_i = \min_{1 \leq i \leq n} (1/p_i).$ This (or its reciprocal) is sometimes called the Berger–Parker index, and might as well be written $|A|_\infty$. It has all the good properties (2)—(5) of cardinalities.

Back to entropy

For some purposes it’s preferable to use a measure that’s log-like (as entropy is) rather than multiplicative (as cardinality is). For example, in information theory it’s natural to count how many bits are needed to encode a message, and that’s a log-like measure.

So, for any $\alpha \geq 0$, let’s define the $\alpha$-entropy of a finite probability space $A = (p_1, \ldots, p_n)$ to be $H_\alpha(A) = \log |A|_\alpha = \left\{ \begin{matrix} \frac{1}{1 - \alpha} \log (\sum p_i^\alpha) & if \alpha \neq 1 \\ - \sum p_i \log p_i & if \alpha = 1. \end{matrix} \right.$ The $\alpha$-entropy $H_\alpha$ is usually called the Rényi entropy of order $\alpha$. By construction, each $H_\alpha$ is log-like, takes its minimal value $0$ when some $p_i$ is $1$, and, for a fixed $n$, takes its maximum value $\log n$ when $p_1 = \cdots = p_n = 1/n$. The $1$-entropy $H_1$ is just the Shannon entropy $H$.

Summary, and preview of Part 2

We’ve been discussing ecosystems, which for the purposes of this first post are simply finite sequences $(p_1, \ldots, p_n)$ of non-negative numbers summing to $1$. We’ve seen three families of measures of ecosystems, each indexed over non-negative real numbers $\alpha$:

• The diversity measures $D_\alpha$.  The values $\alpha = 0, 1, 2$ correspond to the most popular diversity measures in ecology. Generally, the measure $D_\alpha$ can be interpreted as ‘expected surprise’.
• The cardinalities $|  |_\alpha$.  These have excellent mathematical properties, e.g. the $\alpha$-cardinality of an $n$-species ecosystem is always between $1$ and $n$, and $\alpha$-cardinality is multiplicative.
• The Rényi entropies $H_\alpha$.  Again, these have excellent mathematical properties, and like Shannon entropy (the case $\alpha = 1$), they are all log-like.

The three ways of measuring are completely interchangeable: for a fixed $\alpha$, any of the three numbers $D_\alpha(A)$, $|A|_\alpha$ and $H_\alpha(A)$ can be derived from any of the others. The formulas above tell you how.

What’s next?

Well, a weak point in everything so far is the extremely crude modelling. We’ve taken the species in our ecosystem to form a mere set: two species are either equal or not, and that’s that. But when you think about biodiversity, about the variety of life, you instinctively grasp that there’s more to it: some species are quite similar, some very different. This should influence the measurement of diversity.

In the second post I’ll explain a way of building this into the model. Specifically, the collection of species in an ecosystem will be modelled as a metric space instead of a mere set. (A set can be regarded as a metric space in which every point is distance $\infty$ from every other point.) This will lead us into connections between biodiversity, entropy and the cardinality of metric spaces.

Posted at October 16, 2008 9:07 AM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1819

### Re: Entropy, Diversity and Cardinality (Part 1)

How surprised do you expect to be tomorrow? Personally, I expect to be mildly surprised but not astonished: that’s what most of my days are like.

Do you choose your surprise function so as to arrive at this expection of mild surprise, or is it chosen independently? Imagine someone who finds life extraordinarily unsurprising or extraordinarily surprising. These would be, I think, unpleasant situations to be in, but would we not say that their surprise functions were not well-tuned?

Or perhaps we organise our lives so that for our chosen surprise function, life is as surprising as we desire. We opt for a quiet life or a hectic one, etc.

I would suspect that a bit of both occurs.

Another ‘pathology’ we might observe is someone taking an event of a certain kind to be much less probable than we consider it to be. Perhaps being amazed to find two people in a group of 26 having the same birthday.

Posted by: David Corfield on October 16, 2008 11:10 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Now that you’ve brought this passage to everyone’s attention, I fully expect patrons of the Café to be jumping out from behind lamp-posts in an effort to astonish me.

Note: I fully expect it. It won’t work.

Posted by: Tom Leinster on October 16, 2008 6:36 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Could you act at least mildly surprised then?

Posted by: Todd Trimble on October 17, 2008 1:10 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Tom,
Lucid indeed. In fact, ready for prime time or at least posting to the arXiv.
Anyone else agree?

jim

Posted by: jim stasheff on October 17, 2008 2:24 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I thought the $n$-Category Café was prime time.

But yeah: I hope Tom will eventually publish a paper touching on these ideas.

Posted by: John Baez on October 17, 2008 9:59 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Something linked to what I said about surprise and having a probability distribution different from the world’s (assuming you’re not a follower of de Finetti – ‘probability does not exist’ – but even then there’s a way around) is the Kullback-Leibler divergence.

If the world is distributed according to $P$ and your surprise function is $\sigma (p) = -log p$, then to minimise your expected surprise you should match your probability distribution to the world’s.

A measure of the distance of one distribution to another is how badly you do with respect to this minimum.

So, the Kullback-Leibler divergence from distribution $P$ to $Q$ over a finite set is

$KL(P || Q) = \sum_i {P(i) log \frac{P(i)}{Q(i)}}$

This can be seen as your expected extra surprise due to you chosing $Q$ when the world is $P$. It’s a non-symmetric metric, but of course after Lawvere that doesn’t scare us.

Posted by: David Corfield on October 17, 2008 8:58 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I’ve been trying to understand Kullback–Leibler divergence, and wanted to interpret it as the ‘distance’ between two probability distributions on the space, as you suggest. It doesn’t bother me that it’s non-symmetric… but it does bother me that, according to the ‘Motivation, properties and terminology’ section of the wikipedia page, it doesn’t satisfy the triangle inequality.

Any thoughts?

Posted by: Tom Leinster on October 19, 2008 3:11 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

The KL divergence is in some ways more like the square of a distance with Pythagorean properties. This is the perspective of information geometry. (I’ve blogged about it, but most at my old blog, now unhosted.)

So imagine yourself believing distribution $Q$. Now you’re told some moment of the real distribution $P$, e.g., the expectation value. Now a so-called ‘exponential family’ passes through $Q$ ‘orthogonal’ to the family of distributions, containing $P$, of fixed expectation.

The two families are of complementary dimension, so meet in a single point, $Q^{'}$. I can think of updating my probability to this point of intersection, and I know wherever $P$ is in its family, I will have reduced my surprise by a fixed amount.

For any $P$ in that family of fixed expectation, the KL-divergence to the projection $Q^{'}$ added to the KL-divergence from $Q^{'}$ to $Q$ is equal to the KL-divergence on the ‘hypoteneuse’, i.e., from $P$ to $Q$ directly.

Better then to think of the divergence as a squared distance. There must be a heap of references for this point of view, e.g., Nonextensive Pythagoras’ Theorem.

Posted by: David Corfield on October 19, 2008 4:30 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Ah, OK. But is it actually the case that the square root of the KL divergence is a metric? Or is the intuition less precise than that?

(I tried to find out using the links you gave, and a little googling, but to no avail.)

Posted by: Tom Leinster on October 20, 2008 3:34 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I’ve never seen it said anywhere that the square root of the KL-divergence satisfies the triangle inequality, but then I’ve never seen it said that it fails.

There’s some interesting material in a paper by Carlos Rodriguez where he writes:

Special cases of $\delta$-information-deviations have been discovered and re-discovered more, perhaps, than any other concept in the history of statistics. With the sole exception of $\delta = 1/2$, the $\delta$-deviations are not symmetric, and do not satisfy the triangular inequality so they don’t define distances, in the usual way. However, the above properties make them to be the only statistically meaningful way of measuring separation in the extended space of unnormalized probability distributions. The $\delta$-information-deviations are the one and only one measures that are positive definite, and preserve all the fundamental symmetries of statistical inference, i.e. invariant under coordinate transformations of both the data and the parameter space and invariant under sufficient reductions of the data. (p. 5)

The $\delta = 1/2$ deviation is

twice the square of the Hellinger distance, i.e. the familiar $L_2$ distance between wave functions.

Looking at the properties that the $\delta$-deviations satisfy (pp. 4-5), perhaps only ‘1. homogeneity’ suggests that it’s the divergence rather than square root which is more like a distance.

Posted by: David Corfield on October 20, 2008 9:32 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

There’s work in the Japanese school that argues that the KL-distance (and other information distances) should actually be thought of as “energy” or squared-distance rather than as distance. Some points in this regard:

* The KL distance satisfies a kind of pythagoras theorem. More importantly, the Bregman distances (of which this is one) satisfy a generalized cosine rule. Recall that the standard cosine rule on Euclidean distances says that for a triangle with sides a,b,c, and angle theta between the sides of length a and b, then

c^2 = a^2 + b^2 - 2abcos(theta)

This formula generalizes to Bregman distances, as long as you’re willing to replace the last time (an inner product) by a “dualized” version of it.

This suggests that the correct way to think about KL is as a squared distance, in which case the lack of triangle inequality is not surprising.

* But what about the square root ? Here’s what *is* known: if you construct the symmetrized version of the KL distance, termed the Jensen-Shannon distance (JS(p,q) = 0.5(KL(p,m) + KL(q,m)), m = 0.5(p+q)), then this is also a Bregman divergence, AND its square root is a metric.

* Going back to the energy reference, there’s a way of interpreting information distances (such as the KL, and alpha-divergences in general) in the following manner: Take a path in the manifold of distributions, and imagine you’re tracking a point that’s connected to a stiff spring in the “dual” manifold. Now compute the energy to move this point along the path in the primal, while measuring string stretch in the dual. That yields the information distance.

Posted by: Suresh Venkat on November 11, 2008 8:06 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Stumbling across this conversation 14 years later, I’ll take the opportunity to answer the question David and I were discussing. The square root of KL-divergence (relative entropy) does not satisfy the triangle inequality. A simple counterexample can be found on page 77 of my book.

Posted by: Tom Leinster on October 31, 2022 2:34 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

The symmetrized KL-divergence is a metric for what it’s worth.

Posted by: AB on November 2, 2022 4:58 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I’ve learned to accept that we don’t get a metric from the KL divergence that way (the ‘right’ way being the Fisher metric). But it’s definitely good to put these facts in places where people might see them!

Wikipedia would be even better….

Posted by: John Baez on November 2, 2022 12:31 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Did you know that surprise is measured in wows? For an account of why ordinary television provides more wows per second that white snow take a look at this.

Posted by: David Corfield on October 17, 2008 9:24 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Posted by: David Corfield on October 16, 2008 11:16 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

This is a fascinating and lucid post as always. I look forward to incorporating the word “biodiversity” into my mathematical vocabulary, eg. “Sir, my main concern with the three-dimensional Riemannian manifold you have described is that it lacks biodiversity.”

Posted by: Bruce Bartlett on October 17, 2008 12:28 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Thanks!

If you read the bit about Simpson diversity, you can also claim that you know something about panmictic heterozygosity.

Posted by: Tom Leinster on October 17, 2008 1:53 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

One of these days I shall get round to finishing the grant proposal for the project:

“Computing the genus of the Great Barrier Reef”

Anyone want to be a joint PI on this one?

Posted by: Andrew Stacey on October 17, 2008 8:20 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I’m in. Does the funding include provision for Open Water Ones for the PI’s?

By the way, how is your Gimungous Genealogy coming along for Atiyah80? A friend of mine asks if you’re going to put pictures on the chart. Also, since he’s likely to be there (unlike me), I can venture the name of “Daniel Stevenson” as a student of Michael Murray in the bottom right of the diagram.

Posted by: Bruce Bartlett on October 20, 2008 6:24 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Thought you might want to know that that map is appearing as a busted png in firefox 3.03 on windows xp, by me.

Posted by: Josh W on November 12, 2008 12:56 AM | Permalink | Reply to this

### panmictic heterozygosity

Tom Leinster wrote,

If you read the bit about Simpson diversity, you can also claim that you know something about panmictic heterozygosity.

Of course, lots of interesting things happen when a population is not panmictic. For example, if your cicadas or your bacteria are spread out spatially across a region, then what happens at $\vec{x}$ doesn’t directly influence what happens at $\vec{y}$, and interesting kinds of heterogeneity can arise. A physicist might call this “the breakdown of the mean-field assumption”, language which has been embraced by at least some ecologists. See, e.g., Dieckmann, Law and Metz’s The Geometry of Ecological Interactions (2000), or Lion and van Baalen’s “Self-structuring in spatial evolutionary ecology” (2007).

This is the sort of research which lends itself readily to simulations and makes for pretty screensavers: ooh, look, the red dots are eating the green dots! Some people do predator/prey or pathogen/host models, while others implement “altruism” and “selfishness” as different strategies in a Prisoner’s Dilemma game (the winner when neighbours play each other gets to propagate). Attempts to derive useful analytic expressions involve tricks like summarizing the state of the ecosystem with probabilities, say $p_a$ and $p_s$ for the fractions of grid sites occupied by altruists and selfish individuals respectively. A variable like $q_{a|s}$ denotes the probability that a randomly chosen neighbour site of a randomly chosen selfish individual is occupied by an altruist. Then one writes horrendous differential equations in the $p$s and the $q$s, linearizes around a fixed point and investigates the stability of that fixed point under perturbations.

I don’t really know much about diversity indices used in this kind of work.

Posted by: Blake Stacey on October 17, 2008 6:06 PM | Permalink | Reply to this

### Re: panmictic heterozygosity

Thanks, Blake. That’s interesting.

One thing I should perhaps have emphasized in my post is that (bio)diversity is a local or intensive property. For example, the diversity of species in a field is something that you should be able to measure by taking samples here and there (e.g. by throwing down quadrats, those metre-square frames) — it has nothing to do with how big the field is.

(Of course, you have to take enough samples that it’s representative. I hardly touched on statistical questions such as this.)

The local property of diversity was (very) implicit in the mathematical formalization that I used. We modelled an ecosystem as a finite sequence $(p_1, \ldots, p_n)$ of non-negative reals, representing the abundance of each species, adding up to $1$. In other words, we normalized so that the size of the ecosystem became irrelevant.

I only met the word panmictic very recently. It’s an adjective that can be applied to populations; the noun is panmixia, and it seems to have a couple of related meanings along the lines of ‘everything’s well-mixed’. The first meaning that I found was ‘everyone can mate with everyone else’. The second was ‘individuals can move around freely’, as if distance were no object. For many species, proximity is a necessary condition for mating, so the two meanings are connected. Am I understanding right, Blake?

I suppose there is some kind of assumption of panmixia, or homogeneity, behind everything I wrote in my post. After all, I wrote about ‘the diversity of a population’, and if diversity is a locally-determined quantity, this is like talking about ‘the temperature in Asia’.

Posted by: Tom Leinster on October 19, 2008 3:47 PM | Permalink | Reply to this

### Re: panmictic heterozygosity

Tom Leinster wrote,

The first meaning that I found was ‘everyone can mate with everyone else’. The second was ‘individuals can move around freely’, as if distance were no object. For many species, proximity is a necessary condition for mating, so the two meanings are connected. Am I understanding right, Blake?

That sounds right to me.

Posted by: Blake Stacey on October 19, 2008 5:37 PM | Permalink | Reply to this

### Panmixia; Re: panmictic heterozygosity

Panmixia is a theoretical extreme in the same way that porn is science fiction describing an alternate universe where creatures who appear quite human (or slightly exaggerated and lacking refractory period) have utterly different motivation, such that any sexual suggestion from anyone is accepted passionately by anyone else. In physiology, a refractory period is a period of time during which an organ or cell is incapable of repeating a particular action, or (more precisely) the amount of time it takes for an excitable membrane to be ready for a second stimulus once it returns to its resting state following an excitation.

Now that I have your attention, the place (historically) to start is:
Alfred Russel Wallace : Alfred Wallace : A. R. Wallace :
Russel Wallace : Alfred Russell Wallace (sic)

Panmixia and Natural Selection (S499: 1894)

Editor Charles H. Smith’s Note: A letter to the Editor printed in the 28 June 1894 number of Nature. Original pagination indicated within double brackets. To link directly to this page, connect with: http://www.wku.edu/~smithch/wallace/S499.htm

pan·mix·i·a )
n.
Random mating within a breeding population.

ETYMOLOGY: New Latin : pan + Greek mixis, act of mingling (from mignunai, to mix;

panmixy, panmixia
n. interbreeding without limitation. panmictic, a.
Helicon Publishing LTD 2008.

Example in actionL

Panmixia: an example from Dawson’s burrowing bee (Amegilla dawsoni) (Hymenoptera: Anthophorini)

Authors: BEVERIDGE, MAXINE; SIMMONS, LEIGH W.

Source: Molecular Ecology, Volume 15, Number 4, April 2006 , pp. 951-957(7)

Publisher: Blackwell Publishing

Abstract:
Dawson’s burrowing bee is a large, fast-flying solitary nesting bee endemic to the arid zone of Western Australia. In this study the population structure of the species was examined with molecular markers. Using eight microsatellite loci, we genotyped 531 adult female bees collected from 13 populations of Dawson’s burrowing bee, Amegilla dawsoni, across the species range. The mean number of alleles per locus ranged from 4 to 38 and expected heterozygosity was uniformly high with a mean of 0.602. Pairwise comparisons of FST among all 13 populations ranged from 0.0071 to 0.0122 with only one significant estimate and an overall FST of 0.001. The entire sample collection was in Hardy–Weinberg equilibrium and there was no evidence of inbreeding with a mean FIS of 0.010. The mating and nesting behaviour of this bee suggests that gene flow would be limited by monandry and the fact that almost 90% of females mate immediately on emergence. Nevertheless there is obviously sufficient gene flow to maintain panmixia, and we suggest that this results from infrequent and unreliable rainfall in the species range, which causes the bees to congregate at limited food resources, allowing a small number of unmated females from one emergence site to come into contact with males from another population. In addition, when drought eliminates food resources near an emergence site, the whole population may move elsewhere, increasing gene flow across the species range.

Keywords: Amegilla dawsoni; Dawson’s burrowing bee; Hymenoptera; microsatellite; panmixia; population genetics

Document Type: Research article

DOI: 10.1111/j.1365-294X.2006.02846.x

Posted by: Jonathan Vos Post on October 21, 2008 7:45 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Tom wrote:

More generally, there is a symmetric operad $C$ given by $C_n = \Delta^{n - 1} = \{ (p_1, \ldots, p_n) \in \mathbb{R}^n : p_i \geq 0, \sum p_i = 1 \}$ with composition as follows: if $A = (p_1, \ldots, p_n),   A_1 = (p_{1, 1}, \ldots, p_{1, k_1}),   \ldots,   A_n = (p_{n, 1}, \ldots, p_{n, k_n})$ then $A \circ (A_1, \ldots, A_n) = (p_1 p_{1, 1}, \ldots, p_1 p_{1, k_1}, \ldots, p_n p_{n, 1}, \ldots, p_n p_{n, k_n}).$ (A $C$-algebra might be called a ‘convex algebra’; for instance, any convex subset of a real vector space is naturally one.)

The equations you just wrote down will be terrifying to mere mortals, so I think it’s good to add: ‘a $C$-algebra is just a set where you can take convex linear combinations of elements, and the obvious rule applies when you take a convex linear combination of convex linear combinations.’

James Dolan has thought a lot about these gadgets, which he calls merely ‘convex spaces’. They’re close relatives of ‘affine spaces’, where you drop the condition that $p_i \ge 0$. And these, in turn, are close relatives of ‘vector spaces’, where you also drop the condition that $\sum_i p_i = 1$.

In particular, every vector space has an underlying affine space, which has an underlying convex space, which has an underlying set. Moreover, these forgetful functors have interesting left adjoints. In particular, the free convex space on a set is the set of probability measures on that set.

Technical note: here ‘measures’ are defined in terms of the $\sigma$-algebra generated by finite subsets. This is no big deal if we’re talking about probability measures on finite sets, but real-world probability theorists often use probability measures on infinite sets… and then this choice of $\sigma$-algebra is a bit annoying.

Of course, you can also define convex spaces using a monad instead of an operad! This lets us generalize things in a nice way described by David. Namely: we consider a monad not on the category of sets, but on the category of Polish spaces. These are certain well-behaved measurable spaces, which I used to fear, but have come to love.

The question is: what’s going on here, abstractly? I’m imagining restating this equation in such a way that there is no mention of the $p_i$s, and thus, perhaps, finding a good characterization of entropy in operadic terms.

Maybe I’ll ask Jim if he’s thought about abstract characterizations of Shannon entropy in this language. There are lots of ways of deriving the usual formula for Shannon entropy from a list of axioms — all of which I forget. David knows about these. They might help.

Posted by: John Baez on October 19, 2008 5:40 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

John wrote:

The equations you just wrote down will be terrifying to mere mortals, so I think it’s good to add: ‘a $C$-algebra is just a set where you can take convex linear combinations of elements, and the obvious rule applies when you take a convex linear combination of convex linear combinations.’

I agree that this is a good informal description. But I think you’re missing a subtlety in some of the other things you wrote.

What we both have in mind is this. Let $X$ be a convex subset of $\mathbb{R}^n$ (or any other real vector space). Then for every $n$-tuple $(p_1, \ldots, p_n)$ of non-negative reals summing to $1$, there’s an induced operation $h_{p_1, \ldots, p_n}: X^n \to X,    (x_1, \ldots, x_n) \mapsto p_1 x_1 + \cdots p_n x_n.$ These operations will satisfy various axioms, such as the one you referred to about convex linear combinations of convex linear combinations. You might define a convex space to be a set $X$ equipped with, for each such $n$-tuple $(p_1, \ldots, p_n)$, a map $h_{p_1, \ldots, p_n}: X^n \to X$ satisfying all the axioms that are satisfied when $X$ is a convex subset of $\mathbb{R}^n$.

The subtlety is that a convex space is not then the same thing as a $C$-algebra. That’s because in a convex subset of $\mathbb{R}^n$, we have equations such as $\frac{1}{2} x + \frac{1}{2} x = x,$ and so in a convex space, we have equations such as $h_{\frac{1}{2}, \frac{1}{2}} (x, x) = x.$ However, this equation is not satisfied in a $C$-algebra, in general. (The repeated variable $x$ gives a strong hint that there’s no operad in the world whose algebras are exactly convex spaces.)

If I’m understanding things correctly, you’re using the term ‘convex space’ in two conflicting ways. On the one hand, you wrote:

James Dolan has thought a lot about these gadgets [$C$-algebras], which he calls merely ‘convex spaces’.

On the other, you wrote:

the free convex space on a set is the set of probability measures on that set

and go on to clarify that you’re talking about something like probability measures ‘with finite support’ — I think I know what you mean, anyway. But in order for that to be true, a ‘convex space’ in your/Jim’s usage had better be what I called a ‘convex space’ just now — and that’s not the same thing as a $C$-algebra.

I’m guessing that you do want equations such as $\frac{1}{2} x + \frac{1}{2} x = x$, in your definition of convex space. It certainly looks as if that would provide a better fit with ideas on probability monads.

Posted by: Tom Leinster on October 19, 2008 10:21 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Is the difference something to do with the fact that the Giry monad is used to compose conditional distributions? So I have 3 spaces $A$, $B$, and $C$, and I want a way to compose $P(A|B)$ with $P(B|C)$, and can do so in the Kleisli category.

Tom’s case seemed to concern a space $A$, and for each $a \in A$, a space $B_a$. Then we’re looking to produce a distribution over the union of $B_a$, from distributions over $A$ and each $B_a$. Rather as though we had classified the world into genera, then find that each genus divides further into sub-genera.

On the other hand, a distribution over $A$ is a conditional probability $P(A|1)$, and we could bundle the distributions over the $B_a$ as a single conditional probability $P(\cup_a B_a | A)$. Then compose as above.

Posted by: David Corfield on October 20, 2008 10:46 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Continuing this line of thought, the conditional entropy is defined for two random variables, $X$ and $Y$ as

$H(Y | X) = \sum_x p(x) H(Y | X = x).$

In a sense, then, to a conditional distribution, $P(Y | X)$, or if you like an arrow in the Kleisli category of the Giry monad, there is associated a map from distributions over the domain of $X$ to the reals.

Posted by: David Corfield on October 22, 2008 9:10 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Here’s a point of view on this. It involves the relationship between Lawvere theories and operads. Both are ways of describing structures borne by sets. Lawvere theories are based on categories with finite products, and operads on symmetric monoidal categories. Every category with finite products is also a symmetric monoidal category, so Lawvere theories give rise to operads, and in a simple way: given a theory $T$, the underlying operad $UT$ has $UT_n=T(n,1)$, the set of morphisms in $T$ from $n$ to $1$ (which in turn is the set of $n$-ary operations).

This defines a functor $U:LawvereTheories\to Operads$ which has a left adjoint $F$. The left adjoint is slightly harder to describe, so I won’t (but it’s a piece of cake if you’ve learned to love your coends) but it’s actually better behaved with respect to algebras. For an operad $P$, the $P$-algebras are the same as the $FP$-models. But for a theory $T$, the $T$-models are generally NOT the same as the $UT$-algebras, and this will be the main point.

There’s a Lawvere theory $T$ whose models are the “convex spaces”. Like any Lawvere theory, it’s objects are the natural numbers, and we take the set of morphisms from $m$ to $n$ to be the set $T(m,n)=C^n_m$ where $C_m$ comes from Tom’s operad. More explicitly, a morphism from $m$ to $n$ is an $n\times m$ matrix of non-negative reals, in which the sum of the entries of any row is 1. Composition is matrix multiplication. Tom’s operad $C$ is just $UT$.

Now a $T$-model is a convex space, but as I said before, this is not the same thing as a $UT$-algebra.

For an example of a $C$-algebra which is not a $T$-model (a convex space), take the algebra structure on $R$ I described here. In it, the thing one might be tempted to call $\tfrac{1}{2}x+\tfrac{1}{2}x$ is actually $x+\ln 2$.

Posted by: Steve Lack on October 21, 2008 8:33 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

An alternative way to think about what Steve wrote is in terms of monads.

Every operad $P$ gives rise to a monad $T_P$ on $Set$, whose algebras are exactly the $P$-algebras.

Every monad $T$ on $\Set$ gives rise to an operad $P_T$, where $P_T(n) = T(\{x_1, \ldots, x_n\})$, the set of $n$-ary operations, or ‘words’ in $x_1, \ldots, x_n$. (Here the $x_i$ are ‘formal symbols’; the only important thing is that $\{x_1, \ldots, x_n\}$ is an $n$-element set.) Composition is by substitution of words; you can write that down in terms of the monad structure of $T$.

This defines functors $F: Operads \to Monad(Set)$ and $U: Monad(Set) \to Operads$ where $F(P) = T_P$ and $U(T) = P_T$; and then $F$ is left adjoint to $U$.

I like to think of operads of the form $P_T$ as ‘word operads’. A good example comes from taking $T$ to be the monad for commutative rings. Then $P_T(n) = \mathbb{Z}[x_1, \ldots, x_n]$ and composition is by substitution. An unexpected (?) example comes from taking $T$ to be the identity monad; then $P_T(n) = n$ (an $n$-element set), and I’ll leave you to figure out how composition works. I say ‘unexpected’ because this operad is rather simple, but I never heard about it from anyone else. There’s a bit more about ‘word operads’ in Example 2.2.11 of my book.

All of the above remains true if you replace ‘monad’ by ‘finitary monad’ throughout. Under the equivalence between finitary monads and Lawvere theories, the story then becomes the same as what Steve said.

Posted by: Tom Leinster on October 21, 2008 11:27 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I learned to love my coends, please write down the left adjoint to $U$ :)

Posted by: Fosco Loregian on September 30, 2016 5:14 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

It’s tough to keep up with the pace of this conversation, but I’ll try.

The left adjoint to $U$ is, as it says here, the functor from operads to monads given by $P \mapsto T_P$. Here $T_P$ is the monad on $Set$ given, for $X \in Set$, by

$T_P(X) = \int^n P(n) \times X^n.$

It was symmetric operads we were talking about, right? If so, this coend is over the category of finite sets and bijections (or a skeleton thereof). If we were dealing with non-symmetric operads, then the coend would be over the discrete category of natural numbers—which means that it’s just a coproduct.

Posted by: Tom Leinster on September 30, 2016 5:55 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

In the middle of the night after posting my remarks I realized that ‘convex spaces’, ‘affine spaces’ and (most famously) ‘vector spaces’ cannot be defined as algebras of an operad, but rather as algebras of a Lawvere theory.

So, throughout my remarks please cross out ‘operad’ and insert ‘Lawvere theory’. And, let’s not use ‘convex space’ as a synonym for Tom’s notion of ‘$C$-algebra’.

By now you guys have straightened it all out, but anyway: I agree.

And just to be make the obvious more so: there is no algebraic theory for Giry’s generalization of convex spaces: if you tried to create such an algebraic theory, you’d discover yourself needing infinitary operations. Namely, integrals instead of sums.

(Right now I’m working on categories that have ‘direct integrals’ instead of just ‘direct sums’, so this flavor of infinitary operation is very much on my mind.)

Posted by: John Baez on October 21, 2008 10:00 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

G’day Tom. This is very interesting. I’m looking forward to the next post.

Regarding the operad question, here’s a possbile abstract account.

First of all, there is a structure of C-algebra on the set $R$ of real numbers. (You may restrict to non-negative reals if you like.) This is an action $C\circ R\to R$ which, given an $A\in C_n$ and an n-tuple $(x_1,\ldots,x_n)$ of reals, outputs the real $H(A)+\sum^n_{i=1}p_i x_i.$

We now need to beef this up just slightly. An operad, as you know, is monoid in the monoidal category [P,Set] with respect to the substitution product $\circ$. Any set $X$ can be regarded as a constant functor $X:P\to Set$. An algebra structure on $X$ for the operad $C$ gives an action of the monoid $C$ on the object $X$ of [P,Set]. In particular, we have the object $R:P\to Set$ of [P,Set], which is constant at the set of reals, and it has an action $C\circ R\to R.$ Your entropy function is a map $H:C\to R$ in [P,Set], and the equation $H(A(A_1,\ldots,A_n))=H(A)+\sum^n_{i=1}p_i H(A_i)$ amounts to the commutative diagram which asserts that $H$ respects the $C$-actions on $C$ and $R$. (Maybe someone more experienced with iTeX then I am can draw this. It’s a commutative square, with the two paths going from $C\circ C$ to $R$, and with the other two vertices being $C\circ R$ and $C$.)

Because C is free as a C-algebra, constructing a C-homomorphism from C to R is very easy, once we know that R is a C-algebra. So from this point of view the more basic fact is that R is a C-algebra. (On the other hand, this is no easier to check than your equation; in fact slightly harder.)

Posted by: Steve Lack on October 20, 2008 2:45 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

First of all, I should have mentioned that the expression $H(A)+\sum^n_{i=1}p_ix_i$ involved in the definition of the action $C\circ R\to R$ can also be written as $\sum^n_{i=1}p_i(x_i-\ln p_i)$ so that if we set $x_i=\ln y_i$ then this becomes $-\sum^n_{i=1}p_i\ln\frac{p_i}{y_i}$ which now looks very like the KL-divergence, except that the $y_i$ are arbitrary positive numbers, not required to sum to 1.

And in fact the KL-divergence itself fits in nicely, as follows. Remember that $C$ and $R$ both have actions of the monoid $C$. Now consider the product $C\times C$ (with the product action of $C$). Then KL-divergence is a C-homomorphism $KL:C\times C\to R$.

Posted by: Steve Lack on October 20, 2008 7:18 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

G’day Steve. Thanks for all the interesting thoughts.

There’s something I don’t understand about your comment on Kullback–Leibler divergence. I’ll try to explain…

There are at least two different $C$-algebras on $R$ ($=$ the reals or the non-negative reals, as preferred). First, there’s what I’ll call the simple $C$-algebra structure: for $A = (p_1, \ldots, p_n) \in C_n$ and $x_1, \ldots, x_n \in R$, $A(x_1, \ldots, x_n) = \sum_i p_i x_i$ where the ‘$A$’ on the left-hand side is shorthand for the action of $A$. Second, there’s what I’ll call the subtle $C$-algebra structure: $A(x_1, \ldots, x_n) = H(A) + \sum_i p_i x_i.$ (In fact, both are members of a one-parameter family of $C$-algebra structures on $R$: given a parameter $\lambda \in R$, there’s a $C$-algebra structure on $R$ obtained by changing ‘$H(A)$’ to ‘$\lambda H(A)$’ in the definition of the subtle structure. Then simple is the case $\lambda = 0$ and subtle is the case $\lambda = 1$.)

In your comments you talk about $P$, the skeletal category of finite sets and bijections (also often called $\mathbb{B}$). You use the fact that an operad is exactly a monoid in $[P, Set]$ with respect to the substitution tensor product $\circ$; and $C$, being an operad, therefore acts on itself. It also acts on the constant functor $R: P \to Set$, via the subtle $C$-algebra structure on $R$. You then say:

1. that $H: C \to R$ is a $C$-homomorphism
2. that $KL: C \times C \to R$ is a $C$-homomorphism,

both with respect to the subtle $C$-algebra structure on $R$.

I understand and agree with (1), but not (2). By my calculations, (2) is only true if you use the simple $C$-algebra structure on $R$. Of course, this would be a very nice result too… but I’d like to get it straight.

Here are the gory details. The object $C \times C$ of $[P, Set]$ has, in degree $n$, elements of the form $(B, \tilde{B})$ where $B = (q_1, \ldots, q_n)$ and $\tilde{B} = (\tilde{q}_1, \ldots, \tilde{q}_n)$; then by definition, $KL(B, \tilde{B}) = \sum_i q_i \log (q_i / \tilde{q}_i) \in R.$ The $C$-action on $C \times C$ is as follows. Let $A = (p_1, \ldots, p_n) \in C_n$ and, for $i \in \{1, \ldots, n\}$, let $(B_i, \tilde{B}_i) = ((q_i^1, \ldots, q_i^{k_i}), (\tilde{q}_i^1, \ldots, \tilde{q}_i^{k_i}) \in (C \times C)_{k_i}.$ Then the action is given by $A ((B_1, \tilde{B}_1), \ldots, (B_n, \tilde{B}_n)) = ((p_1 q_1^1, \ldots, p_1 q_1^{k_1}, \ldots, p_n q_n^1, \ldots, p_n q_n^{k_n}), (p_1 \tilde{q}_1^1, \ldots, p_1 \tilde{q}_1^{k_1}, \ldots, p_n \tilde{q}_n^1, \ldots, p_n \tilde{q}_n^{k_n})).$ Hence $KL(A ((B_1, \tilde{B}_1), \ldots, (B_n, \tilde{B}_n))) = \sum_{i = 1}^n \sum_{j = 1}^{k_i} p_i q_i^j \log (p_i q_i^j / p_i \tilde{q}_i^j) = \sum_{i = 1}^n p_i \sum_{j = 1}^{k_i} q_i^j \log (q_i^j / \tilde{q}_i^j) = \sum_{i = 1}^n p_i KL(B_i, \tilde{B}_i)$ …which is equal to $A(KL(B_1, \tilde{B}_1), \ldots, KL(B_n, \tilde{B}_n)),$ but only if we use the simple $C$-algebra structure on $R$.

Posted by: Tom Leinster on October 21, 2008 6:26 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Yes, you’re right. I was careless.

I was also lazy: I usually write $\mathbb{P}$ instead of $P$, as is the custom in my land. I think P is for Permutation. $\mathbb{B}$ would be the braid category.

Posted by: Steve Lack on October 22, 2008 1:37 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Ah, good! This makes the story of Kullback–Leibler much simpler: it doesn’t involve the ‘subtle’ $C$-algebra structure on $\mathbb{R}$.

I’ll have to think about the implications of this. But since entropy can be recovered from KL, I hope it gives a path towards a categorical/operadic characterization of entropy.

On the point of notation, I think Joyal, in his work on species, used $\mathbb{B}$ (or perhaps B) to denote the category of finite sets and bijections. I picked up this notation from somewhere, and I reckon it was there. Presumably B stands for bijection.

Posted by: Tom Leinster on October 22, 2008 2:09 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Tom wrote in part:

(The 1-cardinality is just the cardinality. Here at the n-Category Café, we’re used to the convention that ‘1-widget’ means the same as ‘widget’.)

Ah, but if you don’t quote Joyal’s definition of cardinality, then most people will interpret the ‘cardinality’ of a finite probability space as n, that is the 0-cardinality.

(Ecosystems containing no species at all are off the scale; recall the axiom that Σpi = 1.)

Let’s not be so hasty! Like the field with one element, the empty probability space may not exist, but it still has some nice properties. For instance, its αcardinality obviously should be zero: |∅|α = 0. This allows you to calculate Hα(∅) = −∞; as far as I can tell, Dα(∅) = 1/(α − 1) for α ≤ 1 and Dα(∅) = −∞ for α ≥ 1. These results are consistent with the multiplicative and log-like properties of cardinality and entropy.

It’s easy to make this precise. Let an unnormalised finite probability space be a finite list (P1, …, Pn) of nonnegative real numbers. Let the norm of such a space A be ‖A‖ := ΣiPi. If the norm is positive (in which case n ≥ 1), then you can normalise it to a finite probability space; let pi be Pi/‖A‖. But you can write down formulas for the cardinalities and entropies (at least) without normalising first. (But sometimes you have to take limits to get sensible formulas for norm 0.)

Why bother with unnormalised probability spaces? For the same reason that we bother with unnormalised state vectors in quantum mechanics. The Hilbert space of unnormalised state vectors is easier to work with than the projective sphere of normalised ones (modulo phase), even though this is the actual space of pure states. Similarly, the category of unnormalised finite probability spaces may be easier to work with than the category of normalised ones.

Posted by: Toby Bartels on October 20, 2008 5:37 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Thanks, Toby. I was secretly hoping that you would have something helpful to say about the case of the empty space, which had been troubling me. You are, after all, an acknowledged expert on nothing .

For now, I just want to reply to one of your points.

if you don’t quote Joyal’s definition of cardinality, then most people will interpret the ‘cardinality’ of a finite probability space [with $n$ elements] as $n$

Right. This is always a danger with cardinality (understood in the general sense): when you make a statement like ‘the cardinality of $A$ is $x$’, you need to be careful to state exactly what type of structure you’re treating $A$ as. That’s because, in general:

forgetful functors do not preserve cardinality.

For example, the Euler characteristic of a topological space is not the same as the cardinality of its underlying point-set. The cardinality of a metric space is not the same as the Euler characteristic of its underlying topological space. The Euler characteristic of a category is not the same as the Euler characteristic of its underlying directed graph. And in the case that you remarked on, the cardinality of a finite probability space is not the same as the cardinality of its underlying set.

On the other hand, it would seem that:

free functors do preserve cardinality.

For example, the cardinality of a set is the same as the Euler characteristic of the discrete topological space on it. The cardinality of a set is also the same as the cardinality of the discrete metric space on it (in which the distance between distinct points is $\infty$). The Euler characteristic of a directed graph is the same as the Euler characteristic of the free category on it.

Posted by: Tom Leinster on October 21, 2008 6:51 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Tom wrote in part:

Thanks, Toby. I was secretly hoping that you would have something helpful to say about the case of the empty space, which had been troubling me. You are, after all, an acknowledged expert on nothing ;-).

You’re welcome, Tom, and thank you for your confidence. Of course, that makes it only more annoying that my post is wrong.

I was thinking that you generalise the formulas for cardinality, entropy, and diversity by replacing pi by Pi/‖A‖, then simplify as possible by factoring out ‖A‖. Taken literally, these all require division by ‖A‖, so I got results for A := ∅ by fixing the Pi and taking the limit as ‖A‖ → 0. Unfortunately, that limit is not actually possible, as ‖A‖ is not independent of the Pi. You could fix the pi instead and take ‖A‖ → 0, but then there is no consistent result.

I wouldn’t be surprised if the entire space of unnormalised finite probability spaces (including the empty one) still turns out to be useful, but these measures will probably still break down for the empty space. (In quantum mechanics, after all, many things break down for the zero vector too. Try taking the expectation value of an operator, for example.)

Posted by: Toby Bartels on October 22, 2008 8:04 AM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

I’ll just make the observation that the cardinality $|A|_\alpha$ is the ‘number of equiprobable species’ that would give the same expected $\alpha$-surprise or $\alpha$-diversity measure as your finite probability space $A$. Here the ‘number of species’ is not necessarily an integer.

To explain: on the one hand, if you have $n$ equally probable species, i.e., the space $(\frac{1}{n},\dots,\frac{1}{n})$ then the expected $\alpha$-surprise (or $\alpha$-diversity measure) is

(1)$D_\alpha\left(\tfrac{1}{n},\dots,\tfrac{1}{n}\right)=\sum_1^n \tfrac{1}{n}\sigma_\alpha\left(\tfrac{1}{n}\right)= \sigma_\alpha({n}^{-1}).$

On the other hand, you effectively, though not explicitly, defined the cardinality $|A|_\alpha$ of the space $A$ by

(2)$D_\alpha(A)=\sigma_\alpha({|A|_\alpha}^{-1}).$

Comparing the two hands (!) gives the observation.

Posted by: Simon Willerton on October 21, 2008 4:48 PM | Permalink | Reply to this

### Probability, Statiticws, and Astrophysics in Genetics; Re: Entropy, Diversity and Cardinality (Part 1)

When I was doing my Math/CS/Bio/Nano doctoral dissertation (1975-1977) the field of Population Biology had just been invaded by astrophysicists, who had a different set of equations and solutions at their fingertips.

Sir Julian Sorell Huxley FRS (22 June 1887–14 February 1975) was an
English evolutionary biologist, humanist and internationalist. He was a proponent of natural selection, and a leading figure in the mid-twentieth century evolutionary synthesis. He was Secretary of the Zoological Society of London (1935-1942), the first Director of
UNESCO, and a founding member of the World Wildlife Fund.

Julian Huxley, British biologist, tackled the subject of evolution at
full length, in what became the defining work of his life. His role
was that of a synthesiser, and it helped that he had met many of the
other participants. His book Evolution: the modern synthesis was written whilst he was Secretary to the Zoological Society, and made use of his remarkable collection of reprints covering the first part of the century. It was published in 1942. Reviews of the book in
learned journals were little short of ecstatic; the American Naturalist called it “The outstanding evolutionary treatise of the decade, perhaps of the century. The approach is thoroughly scientific;
the command of basic information amazing.”
* Huxley’s main co-respondents in the modern evolutionary synthesis are usually listed as Ernst Mayr, Theodosius Dobzhansky,
George Gaylord Simpson, Bernhard Rensch, Ledyard Stebbins and the
population geneticists J.B.S. Haldane, Ronald Fisher [hence the deep Probability/Statistics connection] and Sewall Wright [“adaptive landscape” brought manifold approach].
However, at the time of Huxley’s book several of these had yet
to make their distinctive contribution. Certainly, for Huxley, E.B. Ford and his co-workers in ecological genetics were at least as important; and Cyril Darlington, the chromosome expert, was a notable
source of facts and ideas.
An analysis of the ‘authorities cited’ index of Evolution the
modern synthesis shows indirectly those whom Huxley regarded as the most important contributors to the synthesis up to 1941 (the book was published in 1942, and references go up to 1941). The authorities cited 20 or more times are:
Darlington, Darwin, Dobzhansky, Fisher [probability/stat again], Ford, Goldschmidt, Haldane, J.S. Huxley, Muller, Rensch, Turrill, Wright.

What is the connection (if any?) between that Haldane and the Haldane of:

http://arxiv.org/abs/math-ph?papernum=0211026.

The Haldane-Wu exclusion statistics is considered from the generalized extensive statistics point of view and certain related mathematical aspects are investigated. A series representation for the corresponding generating function is proven. Equivalence of two formulae for the central charge, derived for the Haldane-Wu statistics via the thermodynamic Bethe ansatz, is established. As a corollary, a series representation with a free parameter for the Rogers dilogarithm is found. It is shown that the generating function, the entropy, and the central charge for the Gentile statistics majorize those for the Haldane-Wu statistics (under appropriate choice of parameters). From this, some dilogarithm inequality is derived.

Posted by: Jonathan Vos Post on October 22, 2008 2:21 AM | Permalink | Reply to this
Read the post Maximum Entropy Distributions
Weblog: The n-Category Café
Excerpt: Why are familiar distributions so often maximum shannon entropy distributions for their class?
Tracked: November 2, 2008 5:49 PM
Read the post Entropy, Diversity and Cardinality (Part 2)
Weblog: The n-Category Café
Excerpt: Continuation of a discussion of entropy and cardinality and its relevance to ecology
Tracked: November 7, 2008 12:59 PM
Read the post When entropy and ecology collide...
Weblog: Evolving Thoughts
Excerpt: ...the albino silverback blinks once or twice, says knowingly "Yes, yes", and sends those who do understand math to these two posts at The n-Category Café: "Entropy, Diversity and Cardinality" post 1, post 2. If I read it aright,...
Tracked: November 10, 2008 1:30 AM
Weblog: The n-Category Café
Excerpt: Fritz on convex spaces as algebras for the finite Giry monad
Tracked: April 1, 2009 5:42 PM

### Re: Entropy, Diversity and Cardinality (Part 1)

Probably worth mentioning: my compadre Ben Allen of Boston University has a new article in The American Naturalist,A New Phylogenetic Diversity Measure Generalizing the Shannon Index and Its Application to Phyllostomid Bats”.

Protecting biodiversity involves preserving the maximum number and abundance of species while giving special attention to species with unique genetic or morphological characteristics. In balancing different priorities, conservation policymakers may consider quantitative measures that compare diversity across ecological communities. To serve this purpose, a measure should increase or decrease with changes in community composition in a way that reflects what is valued, including species richness, evenness, and distinctness. However, counterintuitively, studies have shown that established indices, including those that emphasize average interspecies phylogenetic distance, may increase with the elimination of species. We introduce a new diversity index, the phylogenetic entropy, which generalizes in a natural way the Shannon index to incorporate species relatedness. Phylogenetic entropy favors communities in which highly distinct species are more abundant, but it does not advocate decreasing any species proportion below a community structure-dependent threshold. We contrast the behavior of multiple indices on a community of phyllostomid bats in the Selva Lacandona. The optimal genus distribution for phylogenetic entropy populates all genera in a linear relationship to their total phylogenetic distance to other genera. Two other indices favor eliminating 12 out of the 23 genera.

The “phylogenetic entropy” is defined in terms of a rooted phylogenetic tree $T$, which gives the evolutionary relationships among the species present in a biological community. Writing $b$ for a branch of $T$, $l(b)$ for the length of that branch and $p(b)$ for the proportion of individuals represented by leaves descending from $b$, the phylogenetic entropy is

$H_{p} = -\sum_b l(b) p(b) \log p(b).$

$H_{p}$ reduces to the Shannon index $H$ in the case that the tree $T$ has uniform branch lengths.

Posted by: Blake Stacey on June 29, 2009 8:10 PM | Permalink | Reply to this

### Re: Entropy, Diversity and Cardinality (Part 1)

Thanks, Blake! That looks like an interesting article.

Posted by: Tom Leinster on June 30, 2009 1:53 AM | Permalink | Reply to this
Read the post The 1000th Post on the n-Category Café
Weblog: The n-Category Café
Excerpt: Meet our new hosts: Alex Hoffnung, Tom Leinster, Mike Shulman and Simon Willerton!
Tracked: November 13, 2009 3:30 AM
Read the post Magnitude of Metric Spaces: A Roundup
Weblog: The n-Category Café
Excerpt: Resources on magnitude of metric spaces.
Tracked: January 10, 2011 4:02 PM

Post a New Comment