## November 24, 2020

### The Uniform Measure

#### Posted by Tom Leinster

Category theory has an excellent track record of formalizing intuitive statements of the form “this is the canonical such and such”. It has been especially effective in topology and algebra.

But what does it have to say about canonical measures? On many spaces, there is a choice of probability measure that seems canonical, or at least obvious: the first one that most people think of. For instance:

• On a finite space, the obvious probability measure is the uniform one.

• On a compact metric space whose isometry group acts transitively, the obvious probability measure is Haar measure.

• On a subset of $\mathbb{R}^n$, the obvious probability measure is normalized Lebesgue measure (at least, assuming the subset has finite nonzero volume).

Emily Roff and I found a general recipe for assigning a canonical probability measure to a space, capturing all three examples above: arXiv:1908.11184. We call it the uniform measure. It’s categorically inspired rather than genuinely categorical, but I think it’s a nice story, and I’ll tell it now.

#### Tony goes swimming

Let’s warm up with a hypothetical scenario. Tony goes swimming once a week, on a variable day. What probability distribution on $\{Mon, Tue, Wed, Thu, Fri, Sat, Sun\}$ should we use to model his swimming habits?

If we have no information whatsoever, it’s got to be the uniform distribution $(1/7, \ldots, 1/7)$, purely by symmetry.

But now we might bring to bear our knowledge that swimming is a leisure activity, which makes it more likely to happen at weekends. Or perhaps Tony slips us a definite clue: he goes at weekends exactly as often as he goes on weekdays. Of course there are many distributions that satisfy this constraint, but again, symmetry compels us to choose

$\Bigl( \frac{1}{10}, \frac{1}{10}, \frac{1}{10}, \frac{1}{10}, \frac{1}{10}, \frac{1}{4}, \frac{1}{4} \Bigr)$

Symmetry only gets us so far, though. What if Tony also tells us that the probability he goes on Friday is equal to the sum of Wednesday’s and Thursday’s probabilities? Any distribution

$\Bigl( \frac{1}{4} - 2p, \frac{1}{4} - 2p, p, p, 2p, \frac{1}{4}, \frac{1}{4} \Bigr)$

with $0 \leq p \leq 1/8$ satisfies the known constraints and the obvious symmetry requirements, and it’s not clear which of them should be regarded as “canonical”.

#### What are we doing?

What’s happening here is that we’re looking for the most uniform distribution possible, but except in simple cases, we can’t say what that means without some way of quantifying uniformity. So, let’s think about that. How uniform is a given probability measure on a space?

You may have guessed that this has something to do with entropy, and you’d be right. But I want to explain all this from the ground up, motivating everything from first principles, without invoking the E word. Also, if you guessed that what we’re going to end up with is a maximum entropy distribution, you’d only be half right. There are actually two key ideas here, and maximizing entropy is only one of them.

#### How spread out is a distribution?

The first key idea is to look for the most spread-out distribution possible on a space. I won’t say just yet what kind of “space” Emily and I worked with, but they include metric spaces, so you can keep that family of examples in mind.

Let’s consider this subset of $\mathbb{R}^2$:

It’s drawn here with an even shading, which corresponds to the uniform distribution — I mean, Lebesgue measure, normalized to give the space a total measure of $1$. But of course there are other probability measures on it, like this one with a single area of high concentration (shaded darker) —

— or this one, with two areas of high concentration —

Which one is the most spread out?

Of course, it depends what “spread out” means. It’s pretty clearly not the second one, where most of the mass is concentrated in the centre. But arguably the third is more spread out than the first, uniform, distribution: relative to the uniform distribution, some of the mass has been pushed out to the sides.

Or take this simpler example. Consider all probability measures on a line segment of length $1$, and let’s temporarily define the “spread” of a distribution on the line as the expected distance between a pair of points chosen randomly according to that distribution. An easy little calculation shows that with the uniform distribution, the average distance between points is $1/3$. But we can do better: if we put half the mass at one endpoint and half at the other, then the average distance between points is $1/2$. So the uniform distribution isn’t always the most spread out! (I don’t know which distribution is.)

This measure of spread isn’t actually the one we’ll use. What we’ll work with is not the distance between points, but the similarity between them.

Formally, take a compact Hausdorff space $X$ and a “similarity kernel” $K$ on it, which means a continuous function $K \from X \times X \to \mathbb{R}^+$ such that $K(x, x) \gt 0$ for every point $x$. You can obtain such a space from a compact metric space by putting $K(x, y) = e^{-d(x, y)}$. That’s the most important family of examples.

Suppose we also have a probability measure $\mu$ on $X$. (Formally, “measure” means “Radon measure”.) We can quantify how ordinary or typical a point is with respect to the measure — in other words, how dark you’d colour it in pictures like the ones above. The typicality of $x \in X$ is

$(K \mu)(x) = \int_X K(x, y) \,d\mu(y) \in \mathbb{R}^+.$

It’s the expected similarity between $x$ and a random point. The higher it is, the more concentrated the measure is near $x$.

The mean typicality of a point in $X$ is

$\int_X K\mu \,d\mu.$

This is high if the measure is highly concentrated. For instance, if we’re dealing with a metric space then $K(x, y) = e^{-d(x, y)}$ always lies between $0$ and $1$, so the maximum possible value this can have is $1$, which is attained if and only if $\mu$ is concentrated at a single point — a Dirac delta. So, $\int K\mu \, d\mu$ quantifies the lack of spread. Hence

$1 \Big/ \int_X K\mu \,d\mu$

quantifies the spread of the measure $\mu$ across the space $X$.

But this is just one way of quantifying spread! More generally, instead of taking the arithmetic mean of the ordinariness (which is what $\int K\mu\,d\mu$ is), we can take any power mean. Then we end up with

$1 \Big/ \Bigl(\int_X (K\mu)^t \,d\mu\Bigr)^{1/t}$

as our measure of spread, for any real $t \neq 0$.

For reasons I won’t go into, it’s convenient to reparametrize with $t = q - 1$ and it’s sensible to restrict to $q \geq 0$. Simplifying, our formula becomes

$D_q^K(\mu) = \Bigl( \int_X (K\mu)^{q - 1} \,d\mu \Bigr)^{1/(1 - q)}$

($q \neq 1$). And although this formula doesn’t make sense when $q = 1$, taking limits as $q \to 1$ gives the righteous definition there:

$D_1^K(\mu) = \exp\Bigl( - \int_X \log(K\mu)\,d\mu \Bigr).$

If this sounds familiar, it might be because Christina Cobbold and I used $D_q^K(\mu)$ as measures of biological diversity. Here $X$ is to be thought of as a finite set of species, $K(x, y)$ indicates the degree of similarity between species (genetic, for instance), and $\mu$ is the relative abundance distribution of the species in some ecological community. High values of $D_q^K(\mu)$ indicate a highly diverse community. The parameter $q$ controls the relative emphasis placed on typical or atypical species: e.g. $q = 0$ gives atypical species as much importance as typical ones, while $q = \infty$ depends only on the most typical species of all.

In any case, Emily and I call $D_q^K(\mu)$ the diversity of $\mu$ of order $q$. Its logarithm,

$H_q^K(\mu) = \log D_q^K(\mu),$

is the entropy of $\mu$ of order $q$. In the special case where $X$ is finite and

$K(x, y) = \begin{cases} 1 &\text{if } x = y \\ 0 &\text{if } x \neq y, \end{cases}$

the entropy $H_q^K(\mu)$ is the Rényi entropy of order $q$, and in the even more special case where also $q = 1$, it’s the Shannon entropy.

For today it doesn’t matter whether we use the diversities $D_q^K(\mu)$ or the entropies $H_q^K(\mu)$, since all we’re interested in is maximizing them, and logarithm is an increasing function. So “diversity” and “entropy” mean essentially the same thing, and in this geometric context, they’re our formalization of the idea of “spread-outness”.

#### What’s the most spread-out distribution of them all?

Fix a space $X$ with a similarity kernel $K$, as above. You won’t lose much if you assume it’s a metric space $X$ with $K(x, y) = e^{-d(x, y)}$ if you like, but in any case, I’ll assume now that $K$ is symmetric. (The theorem I’m about to state needs this.)

Two questions:

• Which probability measure $\mu$ on $X$ maximizes the diversity $D_q^K(\mu)$?

• What is the value of the maximum diversity, $\sup_\mu D_q^K(\mu)$?

We’ve already observed that if we want to maximize diversity (“spread-outness”), the uniform distribution might not be best. We saw that for the line segment and the potato shapes. Another simple example is a three-point space consisting of two points very close together and the third far away. You wouldn’t want to use the uniform distribution, as that would put $2/3$ of the weight at one end and $1/3$ at the other. Something closer to $(1/4, 1/4, 1/2)$ would be more spread out.

So the answers to these questions aren’t going to be simple. But also, there’s an elephant in the room: both answers surely depend on $q$! After all, changing $q$ changes $D_q^K(\mu)$, and different values of $q$ sometimes have conflicting ideas about when one probability measure is more spread out than another. It can happen, for instance, that

$D_0^K(\mu) \gt D_0^K(\nu) \qquad but \qquad D_1^K(\mu) \lt D_1^K(\nu)$

for probability measures $\mu$ and $\nu$ on $X$.

However, Emily and I prove that it doesn’t actually matter! The answers to both questions are miraculously independent of $q$. That is:

• There is some probability measure $\mu$ that maximizes $D_q^K(\mu)$ for all $q \in [0, \infty]$ simultaneously.

• The maximum diversity $\sup_\mu D_q^K(\mu)$ is the same for all $q \in [0, \infty]$.

If this sounds familiar, it might be because Mark Meckes and I proved it in the case of a finite space $X$. Extending it to compact spaces turned out to be much harder than anticipated. For instance, part of the proof is to show that $D_q^K(\mu)$ is continuous in $\mu$, which in the finite case is pretty much a triviality, but in the compact case involves a partition of unity argument and takes up several pages of Emily’s and my paper.

What matters here is the first bullet point: there’s a best of all possible worlds, a probability measure $\mu$ on our space that unambiguously maximizes diversity (or entropy, or spread). Sometimes there’s more than one such measure. But in many examples, including many of the most interesting ones, there’s only one, so here I’ll casually refer to it as the most spread out measure on $X$.

#### Back to the line

The simplest nontrivial example is a line segment. What’s its most spread out measure?

Crucially, the answer depends on how long the line is. It’s a linear combination of 1-dimensional Lebesgue measure and a Dirac delta at each end, but the coefficients change with the length. I could write down the formula, which is simple enough, but that would distract from the main point:

As the length increases to $\infty$, the most spread out measure converges to normalized Lebesgue.

In other words, the Dirac measures at the end fade to nothing as we scale up.

The formal statement is this: if we write

$K^t(x, y) = e^{-t|x - y|}$

then for each real $t \gt 0$, the space $[0, L]$ with similarity kernel $K^t$ has a unique most spread out measure $\mu_t$, and in the weak${}^*$ topology on the space of probability measures, $\mu_t$ converges to the normalized Lebesgue measure on $[0, L]$ as $t \to \infty$.

Another term for “normalized Lebesgue measure” on the line is “uniform measure”. So in this example at least:

The uniform measure is the large-scale limit of the most spread out measure.

We’re going to take the lesson of this example and turn it into a general definition.

#### Defining the uniform measure

Here goes. Let $X$ be a compact metric space. Suppose that for $t \gg 0$, its rescaling $t X$ has a unique most spread out measure $\mu_t$, and that $\mu_t$ has a limit in the weak${}^*$ topology as $t \to \infty$. Then the uniform measure $\mu_X$ on $X$ is that limit:

$\mu_X = \lim_{t \to \infty} \mu_t.$

Conceptually, the difference between the “most spread out” measures $\mu_t$ and the uniform measure $\mu_X$ is that $\mu_t$ depends on the scale factor $t$ (as in the example of the line segment), but $\mu_X$ doesn’t. The uniform measure is independent of scale: $\mu_{u X} = \mu_X$ for all $u \gt 0$. That’s one of the properties that makes the uniform measure canonical.

In summary, the first key idea behind the definition of uniform measure is to take the most spread out (maximum entropy) distribution, and the second key idea is to then pass to the large-scale limit.

#### Recapturing the three examples

Back at the start of the post, I claimed that our notion of uniform measure captured three intuitive examples of the “canonical measure” on a space. Let’s check back in on them.

• Finite spaces.   For a finite metric space, the most spread out measure is not usually uniform, as we’ve seen. But as we scale up, it always converges to what’s usually called the uniform measure. In other words, what Emily and I call the uniform measure is, in this case, what everyone else calls the uniform measure.

One way to think about this is as follows. In general, to get the uniform measure on a space $X$, we take the most spread out measure $\mu_t$ on $t X$ for each $t \gt 0$, then pass to the limit as $t \to \infty$ to get $\mu_X$. But for a finite space, we can do these two processes in the opposite order: first take the limit as $t \to \infty$ of $t X$, giving us a copy of $X$ where all distances between distinct points are $\infty$, and then take the most spread out measure on that space, which trivially is the one that gives equal measure to each point. This is just a story I tell myself: I know of no conceptual reason why interchanging the order of the processes should give the same result, and in any case the story only makes sense for finite $X$, since otherwise we escape the world of compact spaces. But perhaps it’s a helpful story.

• Homogeneous space.   Now take a compact metric space $X$ whose isometry group acts transitively on points. A version of the Haar measure theorem states that there’s a unique isometry-invariant probability measure $\mu$ on $X$. And it can be shown that the most spread out measure $\mu_t$ on $t X$ is just $\mu$. Taking the limit as $t \to \infty$, the uniform measure on $X$ is, therefore, also $\mu$.

There’s a caveat here: the proof assumes that the metric space $X$ is of negative type, a classical condition that I don’t want to go into now. Many spaces are of negative type, including all subspaces of $\mathbb{R}^n$. But it would be nice to now whether the result also holds for spaces that aren’t of negative type.

(And to be more careful than I really want to be in a blog post, it’s assumed here and in many other places that the space concerned is nonempty.)

• Subsets of $\mathbb{R}^n$.   In the case of a line segment, we saw that the uniform measure is the uniform measure in the usual sense (normalized Lebesgue). What about subsets of $\mathbb{R}^n$ in general?

Let’s consider just those compact subsets $X$ of $\mathbb{R}^n$ that have nonzero measure. Then we can restrict Lebesgue measure to $X$ and normalize it to give $X$ a total measure of $1$. Is this canonical probability measure on $X$ the same as the uniform measure that Emily and I define?

It is, and we prove it in our paper. A crucial role is played by a result of Mark Meckes: every compact subset of $\mathbb{R}^n$ has a unique most spread out measure. (The proof is Fourier-analytic.) But the point I want to emphasize is that unlike in the previous two examples, we have no idea how to describe $\mu_t$ for finite scale factors $t$!

However, despite not knowing $\mu_t$ for finite $t$, we can describe the limit of $\mu_t$ as $t \to \infty$ — in other words, the uniform measure on $X$. And as promised, it’s precisely normalized Lebesgue.

Posted at November 24, 2020 7:30 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/3270

### Re: The Uniform Measure

Apologies to Simon for using the word “spread” so much, when he’s already defined it to mean something different (although related). I couldn’t think of a decent synonym.

Posted by: Tom Leinster on November 25, 2020 10:47 AM | Permalink | Reply to this

### Re: The Uniform Measure

This is tangential to the main point of the post, but your first, temporary definition of “spread” is known in statistics as mean absolute difference, and on the unit interval it is indeed maximized by putting half the weight on each endpoint, see e.g. here.

Posted by: Mark Meckes on November 25, 2020 12:57 PM | Permalink | Reply to this

### Re: The Uniform Measure

Aha, thanks! It’s not clear to me intuitively, since if we put half the weight (sand, say) at each endpoint then a random pair of grains has 50% chance of being at the maximal distance from each other (wonderful!) but also 50% chance of being at the minimal distance from each other (terrible!). So it’s good to see a proof.

Posted by: Tom Leinster on November 25, 2020 1:23 PM | Permalink | Reply to this

### Re: The Uniform Measure

Yeah, it’s tempting to say “surely that’s the extremal distribution, because what else could it be”, but the same non-reasoning could just as well lead to the incorrect guess that the diversity-maximizing measure on an interval must also be uniform on the endpoints. I tried, and failed, to find a more “conceptual” proof, and I think that’s why.

Posted by: Mark Meckes on November 25, 2020 2:00 PM | Permalink | Reply to this

### Re: The Uniform Measure

I’d like to amplify Tom’s remark

we have no idea how to describe $\mu_t$ for finite scale factors $𝑡$!

for arbitrary subsets of $\mathbb{R}^n$. In fact, in some sense we know that $\mu_t$ must be more complicated than you would guess.

In particular, for $n = 1$, the most spread out measure $\mu_t$ on a line segment is equal to normalized Lebesgue measure plus point masses at the end points. The weight of those point masses depends on $t$, and vanishes in the limit $t \to \infty$.

It’s natural to guess, then, that for, say, a Euclidean ball in $\mathbb{R}^n$, the most spread out measure $\mu_t$ is normalized Lebesgue measure plus something something supported on the boundary which vanishes as $t \to \infty$. But actually, by comparing results in Tom and Emily’s paper and in a paper of Simon’s, we can tell that when $n$ is odd and $\ge 3$, $\mu_t$ must be supported on a proper subset of the ball.

Which proper subset? We don’t know! But this means that the convergence to normalized Lebesgue measure as $t \to \infty$ must happen in a more subtle way than in the one-dimensional case.

And what about even $n$? Again, we don’t know! But the best guess is that things are more complicated there than in odd dimensions.

Posted by: Mark Meckes on November 26, 2020 10:37 AM | Permalink | Reply to this

### Re: The Uniform Measure

when $n$ is odd and $\geq 3$, $\mu_t$ must be supported on a proper subset of the ball.

As you know, my guess is that it’s supported on a finite union of concentric spheres (for all $n \geq 2$). But I admit there’s not much evidence for this.

Anyway, can you say more about how you combine Simon’s results and ours to reach this conclusion? I’m sorry if you’ve told me this before; my memory’s not getting any better.

Posted by: Tom Leinster on November 26, 2020 12:18 PM | Permalink | Reply to this

### Re: The Uniform Measure

By Proposition 6.3 in your paper with Emily, each $\mu_t$ is balanced (for the similarity kernel $K^t$). Theorem 2 in Simon’s paper implies that if $supp(\mu_t)$ is the entire ball and $\mu_t$ is balanced, then $\mu_t$ must be a weight measure for the ball at scale $t$. Since each compact subset of $\mathbb{R}^n$ has a unique weight distribution (here I guess we need to go to this paper or this one), $\mu_t$ would therefore be the weight distribution for the ball.

On the other hand, Simon’s computations show that for odd $n \ge 3$, the weight distribution of the ball is not a (signed) measure; it includes radial derivatives of the surface area measure on the boundary. So $\mu_t$ cannot be the weight distribution, and therefore $supp(\mu_t)$ cannot be the entire ball.

(Apologies to anyone else reading this who doesn’t know the terminology; it didn’t seem worth the time explaining it all here.)

Posted by: Mark Meckes on November 26, 2020 2:19 PM | Permalink | Reply to this

### Re: The Uniform Measure

Got it; thanks.

Posted by: Tom Leinster on November 26, 2020 7:42 PM | Permalink | Reply to this

### Re: The Uniform Measure

Is there any intuition for why the problem might be harder in even dimensions than odd?

(Quantum information theory runs into an even/odd dimension issue regarding the Clifford group, with implications for things like finite-dimensional Wigner functions that nobody has really thought through completely yet. I doubt it’s the same problem, but analogous troubles are interesting!)

Posted by: Blake Stacey on November 28, 2020 6:09 PM | Permalink | Reply to this

### Re: The Uniform Measure

I don’t know about intuition, but here’s the reason for the difference between even and odd dimensions: the operator $K$ in the setting of $\mathbb{R}^n$ is convolution with the function $e^{-\|x\|_2}$. The Fourier transform of that function is (up to some constants somewhere) $(1 + \|x\|_2^2)^{-(n+1)/2}$. This means that the inverse of $K$ is the pseudodifferential operator $(Id - \Delta)^{(n+1)/2}$. When $n$ is odd that’s actually a differential operator, and classical PDE methods can be brought to bear. But when $n$ is even that fractional power means we get a nonlocal operator, which is trickier to deal with.

If anyone knows how to extract some intuition from this situation, I’d be very happy to hear it!

It may also be worth mentioning that this depends strongly on the fact that we’re equipping $\mathbb{R}^n$ with the Euclidean $\ell^2$ norm. With the $\ell^1$ norm we would get a differential operator for every $n$, but in that case we have different tools that are easier to work with anyway. And with other norms we can’t do much of anything explicitly at all.

Posted by: Mark Meckes on November 29, 2020 1:22 PM | Permalink | Reply to this

### Re: The Uniform Measure

Prompted by the conversation that Mark started, I want to pick up on one little thing in our paper that I was very happy to learn.

Controversially, the most spread out measure on a space needn’t have full support. That is, it can vanish entirely on some nonempty open set. It’s controversial because in the ecological interpretation, that means that maximizing diversity can entail killing off some species entirely!

Mark and I discussed this point in depth in Section 11 of our paper on maximizing diversity. It’s actually reasonable if you think about it! I won’t reproduce the argument here; I’ll just note that intuitively, a species can be absent from the maximizing distribution if it’s very similar to other species.

In the geometric terms of this post, that means that although the most spread out measure on our space $X$ needn’t have full support, no point in $X$ should be too far from the support.

That’s the intuition. But in fact, Emily and I proved a theorem making this precise. It’s Corollary 7.8, which in the case of metric spaces $X$ says: every point of $X$ is within distance $H_{max}(X)$ of some point in the support of the most spread out measure. Here $H_{max}(X)$ is the maximum entropy of $X$: the common value of $\sup_\mu H_q^K(\mu)$ for all $q \in [0, \infty]$. So although the support of the most spread out measure may have holes in it, they can’t be too big.

I’ll say more about the significance of this result in another comment, later.

Posted by: Tom Leinster on November 27, 2020 12:01 PM | Permalink | Reply to this

### Re: The Uniform Measure

Now let’s fix a compact set $X \subset \mathbb{R}^n$ and think about $\mu_t$ (the maximizing or “most spread out” measure for $t X$) as $t \to \infty$. How does the support of $\mu_t$ behave?

Qualitatively, the result I mentioned in the last comment guarantees that the support of each $\mu_t$ doesn’t have holes that are too big. But by exploiting the exact quantitative form (and a result of Mark’s on Minkowski dimension), it follows that

$supp(\mu_t) \to X \,\, as \,\, t \to \infty.$

Here I mean convergence in the Hausdorff metric. Also, I’m viewing $X$ as fixed and the metric as being scaled up by $t$, rather than taking a subset $t X$ of $\mathbb{R}^n$ that varies with $t$.

The moral of the story: although for each individual $t$, the measure $\mu_t$ probably doesn’t have full support, the supports get denser and denser in $X$ as $t \to \infty$.

A surprise: despite this, the uniform measure $\mu_X = \lim_{t \to \infty} \mu_t$ does not always have full support! As I said in the post itself, the uniform measure on $X \subset \mathbb{R}^n$ is just rescaled Lebesgue measure on $X$ (assuming that $X$ has nonzero measure). So if $X$ is something like a lollipop shape in the plane, then the lollipop stick has zero measure; the measure is supported on the disc part.

For large values of $t$, the support of the maximizing measure $\mu_t$ on the lollipop contains lots of points on the stick. No point on the stick is too far from an element of $\supp(\mu_t)$. As $t$ increases, the support of $\mu_t$ gets denser and denser. But the limiting measure $\mu_X$ vanishes on the stick.

There’s no contradiction here, because support isn’t continuous with respect to the Hausdorff distance.

Posted by: Tom Leinster on November 27, 2020 1:16 PM | Permalink | Reply to this

### Re: The Uniform Measure

It’s controversial because in the ecological interpretation, that means that maximizing diversity can entail killing off some species entirely!

Probably this species should be mathematicians.

Posted by: John Baez on November 30, 2020 2:43 AM | Permalink | Reply to this

### Re: The Uniform Measure

You joke, but this is not some weirdo mathematicianly conclusion that contradicts real-world experience. It’s an entirely serious argument, with the real world very much in mind.

To be sure, most people find it counterintuitive at first, but it’s one of those counterintuitive phenomena that becomes intuitive after some thought. Intuitions can be trained!

I’ll quote the relevant part of Mark’s and my article:

We saw in Examples 9.3 and 9.4 that for certain similarity matrices $Z$ [which in the language of this post would be called $K$], none of the maximizing distributions has full support. Mathematically, this simply means that maximizing distributions sometimes lie on the boundary of [the simplex] $\Delta_n$. But ecologically, it may sound shocking: is it reasonable that diversity can be increased by eliminating some species?

We argue that it is. Consider, for instance, a forest consisting of one species of oak and ten species of pine, with each species equally abundant. Suppose that an eleventh species of pine is added, again with equal abundance (Figure 4 [reproduced below]). This makes the forest even more heavily dominated by pine, so it is intuitively reasonable that the diversity should decrease. But now running time backwards, the conclusion is that if we start with a forest containing the oak and all eleven pine species, eliminating the eleventh should increase diversity.

Here’s Figure 4:

We go on:

To clarify further, recall from Section 3 that diversity is defined in terms of the relative abundances only. Thus, eliminating species $i$ causes not only a decrease in [its relative abundance] $p_i$, but also an increase in the other relative abundances $p_j$. If the $i$th species is particularly ordinary within the community (like the eleventh species of pine), then eliminating it makes way for less ordinary species, resulting in a more diverse community.

And here’s what I guess is behind your teasing:

The instinct that maximizing diversity should not eliminate any species is based on the assumption that the distinction between species is of high value. (After all, if two species were very nearly identical — or in the extreme, actually identical — then losing one would be of little importance.) If one wishes to make that assumption, one must build it into the model. This is done by choosing a similarity matrix $Z$ with a low similarity coefficient $Z_{i j}$ for each $i \neq j$. Thus, $Z$ is close to the identity matrix $I$ (assuming that similarity is measured on a scale of $0$ to $1$). Example 10.7 guarantees that in this case, there is a unique maximizing distribution and it does not, in fact, eliminate any species.

We go on to derive necessary and sufficient conditions on a similarity matrix $Z$ for its diversity-maximizing distribution(s) to have full support, i.e. not eliminate any species. All of this is taken from Section 11 of our paper.

Posted by: Tom Leinster on November 30, 2020 1:28 PM | Permalink | Reply to this

### Re: The Uniform Measure

In the oaks and pines example all that’s going on is just that the positive of culling an eleventh of the pine population outweighs the negative of losing a species, right? Intuitively I’d still expect that it would be better for diversity (or at least no worse) to remove some pines of each species instead. So it still seems surprising that eliminating a species could be necessary for maximum diversity.

Posted by: lambda on November 30, 2020 9:20 PM | Permalink | Reply to this

### Re: The Uniform Measure

In the oaks and pines example all that’s going on is just that the positive of culling an eleventh of the pine population outweighs the negative of losing a species, right?

The way I’d say it is that when you start with a population dominated by pine, slightly increasing the non-pine population at the expense of the pines increases diversity.

Intuitively I’d still expect that it would be better for diversity (or at least no worse) to remove some pines of each species instead

That may or may not be the case, depending on the parameters involved. E.g. it’s not coincidence that in our figure, the 11th species is drawn in the middle of the cluster (in species space). The rough idea is that all of its features are displayed more vividly by some other species. If it was at the edge of the cluster, then the maximizing distribution would probably give it positive abundance.

Perhaps it’s helpful to go back to the bit in my post about line segments, just below the blue potato shapes. This isn’t exactly an example as it doesn’t use the diversity measures we’re talking about, but it illustrates the general point:

Consider all probability measures on a line segment of length $1$, and let’s temporarily define the “spread” of a distribution on the line as the expected distance between a pair of points chosen randomly according to that distribution. An easy little calculation shows that with the uniform distribution, the average distance between points is $1/3$. But we can do better: if we put half the mass at one endpoint and half at the other, then the average distance between points is $1/2$.

And as Mark told us in a comment, this probability measure putting half the mass at each endpoint is actually maximally spread out in this sense. Here, the entire interior of the line segment has measure zero — which in ecological language means “eliminated”. Only the two points/species at the ends survive.

Rather fancifully, you could think of the line segment example as portraying a continuum of bird species that only differ in what colour they are, on a scale from black to white. Which distribution of birds maximizes the “spread” in the sense above? It’s the one consisting of 50% black birds, 50% white birds, with every one of the uncountably many grey species eliminated.

This definition of “spread” isn’t one of the diversity measures that we’re using. But maybe it helps to give some intuition as to why, for some diversity measures, a diversity-maximizing distribution can indeed eliminate some species.

Posted by: Tom Leinster on November 30, 2020 10:40 PM | Permalink | Reply to this

### Re: The Uniform Measure

Here’s a simple clarifying example (this time, a genuine example). Consider three species with similarities as shown:

I haven’t shown the similarity between each species and itself, which is taken to be $1$.

Now, suppose we take as our diversity measure the reciprocal of the expected similarity between a random pair of individuals. Formally, writing

$Z = \begin{pmatrix} 1 & 0.9 & 0.8 \\ 0.9 & 1 & 0.9 \\ 0.8 & 0.9 & 1 \end{pmatrix},$

the diversity of a probability distribution

$p = \begin{pmatrix} p_1 \\ p_2 \\ p_3 \end{pmatrix}$

is

$1/p^T Z p.$

This is the case $q = 2$ of the diversity measures described in the post, and also has a longer history: the eminent statistician C. R. Rao investigated this as a diversity measure back in the 1980s. (Well, he used $1 - p^T Z p$ rather than $1/p^T Z p$, but for maximization purposes it doesn’t matter.)

So, to maximize diversity we have to minimize the expected similarity between a random pair of individuals. And if you work it out, you find that there’s exactly one distribution that maximizes diversity: the one that gives 50% abundance to each of the end species, and 0% to the middle species. The middle species is eliminated!

From this you have to conclude that either (i) maximizing diversity can eliminate some species, or (ii) this measure doesn’t deserve to be called “diversity”.

Position (ii) is linguistic rather than mathematical, so it’s hard to answer. But it’s worth noting that almost any diversity measure that takes into account the variation of species (and not just their abundances) has the counterintuitive property we’re discussing.

At the root of all this is that many people take diversity simply to mean the number of species present. With that interpretation, it’s a triviality that maximizing diversity preserves all species. But ecologists and others have used the word “diversity” in a much wider and more flexible sense since at least the 1940s. That gives rise to subtleties like the one we’re discussing now. It may be that “heterogeneity” or “variation” would be more helpful words at times like this.

Posted by: Tom Leinster on November 30, 2020 11:39 PM | Permalink | Reply to this

### Re: The Uniform Measure

This example is helpful, thanks. I had suspected something like this might work (in my head they were horses, donkeys, and mules) but I wasn’t sure whether you could do it with reasonable-looking numbers like this as opposed to doing something like declaring horses and donkeys to have absolutely nothing in common, or making a species more similar to another species than to itself.

By the way, I hope my previous post didn’t read like I was trying to criticize the notion! I really did just mean I found it surprising, not that I thought that anything was wrong with it.

Posted by: lambda on December 1, 2020 12:22 AM | Permalink | Reply to this

### Re: The Uniform Measure

No worries! Thanks for asking a stimulating question. I didn’t take it as critical at all. If I sounded defensive or something, it was accidental, and probably that well-known effect of communication on the internet.

Posted by: Tom Leinster on December 1, 2020 10:24 AM | Permalink | Reply to this

### Re: The Uniform Measure

I think there’s a middle ground between options (i) and (ii) that’s important to highlight.

Whether there exist probability distributions $p$ that maximize diversity while eliminating species is a property of the matrix $Z$. If you object to the idea this should be possible, then rather than reject this entire framework, you can just choose to work with a similarity matrix $Z$ that avoids that. As argued in Tom and Christina’s paper, the flexibility to choose $Z$ in many different ways is one of the strengths of this theory.

I don’t recall the details off the top of my head, but I think I remember convincing myself that certain classes of similarity matrices that are well-motivated from the ecological point of view generically satisfy the conditions in section 11 of my paper with Tom to avoid eliminating species. On the other hand, as argued by Tom above, one cannot readily dismiss all similarity matrices $Z$ that lead to eliminating species as ecologically unreasonable.

Posted by: Mark Meckes on December 1, 2020 9:19 AM | Permalink | Reply to this

### Re: The Uniform Measure

Yes, I completely agree.

As for classes of similarity matrix $Z$ that have the no-elimination property, you may have in mind something more subtle than the following, but we pointed out in Section 10 of our paper that it’s possessed by any $Z$ that’s “ultrametric”. Mathematically, this basically means that $Z_{i k} \geq min\{Z_{i j}, Z_{j k}\}$, which is the case if $Z_{i j} = e^{-d(i, j)}$ for some ultrametric $d$ on the space of species.

Biologically, this condition corresponds to similarity being defined by a tree of a suitable kind, e.g. taxonomic or phylogenetic. For instance, if we define the similarity between two present-day species as the proportion of evolutionary time before the point at which they diverged, then $Z$ is ultrametric, so the diversity-maximizing distribution eliminates no species.

Posted by: Tom Leinster on December 1, 2020 10:34 AM | Permalink | Reply to this

### Re: The Uniform Measure

Yes, ultrametric similarity matrices (and their connection to trees) are exactly what I was thinking of. My comment about “generic” such matrices was because I wasn’t remembering just how strong that property is (didn’t remember that they are strictly positive definite with strictly positive weightings).

Posted by: Mark Meckes on December 2, 2020 9:26 AM | Permalink | Reply to this

### Re: The Uniform Measure

“I haven’t shown the similarity between each species and itself, which is taken to be 1.” Why? There could be biodiversity within the species, and that could be represented as similarity with itself lesser than 1.

Posted by: Kjetil B Halvorsen on December 5, 2020 6:53 PM | Permalink | Reply to this

### Re: The Uniform Measure

Our general framework does indeed allow the similarity between a species and itself to be less than $1$.

But in this particular hypothetical example that I’m using to illustrate a point, I’m taking it to be $1$.

Posted by: Tom Leinster on December 6, 2020 1:35 AM | Permalink | Reply to this

Post a New Comment