Measuring Diversity
Posted by Tom Leinster
Christina Cobbold and I wrote a paper on measuring biological diversity:
Tom Leinster and Christina A. Cobbold,
Measuring diversity: the importance of species similarity.
Ecology, in press (doi:10.1890/10-2402.1).
As the name of the journal suggests, our paper was written for ecologists — but mathematicians should find it pretty accessible too.
While I’m at it, I’ll mention that I’m coordinating a five-week research programme on The Mathematics of Biodiversity at the Centre de Recerca Matemàtica, Barcelona, next summer. It includes a one-week exploratory conference (2–6 July 2012), to which everyone interested is warmly welcome.
In a moment, I’ll start talking about organisms and species. But don’t be fooled: mathematically, none of this is intrinsically about biology. That’s why this post is called “Measuring diversity”, not “Measuring biological diversity”. You could apply it in many other ways, or not apply it at all, as you’ll see.
It’s an example of what Jordan Ellenberg has amusingly called applied pure math. I think that’s a joke in slightly poor taste, because I don’t want to surrender the term “applied math” to those who basically use it to mean “applied differential equations”. Nevertheless, I suspect we’re on the same side.
Long-time patrons of the Café may remember a pair of posts in 2008 on entropy, diversity and cardinality. But those were long posts, a long time ago, and there’s a lot about them that I’d change now. So I’ll start afresh.
Imagine a ‘community’ of organisms — the fish in a lake, the fungi in a forest, or the bacteria on your skin. We divide them into groups, conventionally called species, though they needn’t be species in the ordinary sense. (The division of organisms into species is somewhat arbitrary, which is a problem, though it’s less of a problem with the approach presented here than with many previous approaches.)
We then record two things about the community. First:
Relative abundances The relative frequencies, or abundances, of the species form a probability distribution on . Here is the proportion of the total population belonging to species , where ‘proportion’ is measured in any way you think sensible (number of individuals, total mass, etc).
Note that we only record relative abundances, not absolute abundances. As it’s usually used, the word diversity denotes an intensive quantity. If nine-tenths of a forest is destroyed, it might be a terrible thing, but on the (unrealistic) assumption that all the flora and fauna in the forest are distributed homogeneously, it doesn’t actually cause a decrease in biodiversity.
The second thing we record is:
Similarities The similarity between each pair of species is measured by a real number between and , with denoting total dissimilarity and denoting identical species. Writing the similarity between the th and th species as , this gives an matrix with entries in . Our only assumption on is that its diagonal entries are all : every species is identical to itself.
There are many approaches to measuring inter-species similarity, of which probably the most familiar is genetic, as in ‘you share 98% of your DNA with a chimpanzee’. Different measures of similarity will produce different measures of diversity.
Sometimes one has, instead of a measure of inter-species similarity (measured on a scale of 0 to 1), a measure of inter-species distance (measured on a scale of 0 to ). Distances can be converted into similarities by the transformation , or more generally by for some positive scale factor . That’s not the only transformation you can use, but it has some good mathematical properties.
What we have to do now is take this data and turn it into a single number, measuring the diversity of the community. Actually, it’s not going to be quite as simple as that… but let’s take it one step at a time.
The similarities form an matrix , and the relative abundances can be regarded as forming an -dimensional column vector . So, we get an -dimensional column vector , whose th entry is
This is the expected similarity between an individual of the th species and an individual chosen at random. It therefore measures the ‘ordinariness’, or lack of distinctiveness, of that individual.
The average ordinariness of an individual in the community is, then,
This is greatest if the community is concentrated into a few very similar species. Economists have used the word concentration for quantities like this. Now, we’re after a measure of diversity, which should be inversely related to concentration. So we could define the diversity of the community as the reciprocal of the concentration:
This turns out to be a good measure of diversity. But it’s not the only good one.
Why not? I’ll give two explanations: one mathematical, one ecological.
Mathematically, the point is that when I wrote down the formula for the ‘average’ ordinariness, I neglected the fact that there are many good notions of average. In particular, there are the power means. For , the power mean of numbers , weighted by a probability distribution , is got by transforming each into , then forming their ordinary mean weighted by the s, then applying the inverse transformation. In other words, it’s
We’ll apply this with . For reasons I won’t explain, I’ll shift the indexing by putting , and I’ll restrict to . So, the average ordinariness ‘of order ’ is
This is a measure of concentration. Its reciprocal is
And that, by definition, is the diversity of order of the community. The diversity measure we arrived at above was the case , and is called the quadratic diversity, since it’s the reciprocal of a quadratic form.
The formula for doesn’t make sense for or , but you can easily make sense of it by taking limits. Doing this leads to the definitions
and
Technical note: in order for everything to be well-defined, you have to take the sums and max to be over only those values of for which (that is, over only the species that are actually present).
So, we’ve got not just one measure of diversity, but a one-parameter family of them:
Ecologically, this spectrum of diversity measures corresponds to a spectrum of viewpoints on what diversity is. Consider two bird communities. The first looks like this:
It contains four species, one of which makes up most of the population, and three of which are quite rare. The second community looks like this:
It has only three species, but they’re evenly balanced.
Now, which community is more diverse? It’s a matter of opinion. Or, if you like, it’s a matter of how you interpret the word ‘diverse’. Usually in the mainstream press, and often in scholarly articles too, ‘biodiversity’ is used as a synonym for ‘number of species present’. On this count, the first community is more diverse. But if you’re mostly concerned with the functioning of the whole community, the role of rare species might not be particularly important: maybe your primary concern is that no species is too dominant, and on that score, the second community wins.
Varying the parameter corresponds to varying your viewpoint. Specifically, controls how little emphasis you place on rare species. So the graphs of against , for the two communities, might look like this:
The purple curve represents the first community, and the blue curve represents the second. (The exact shapes of the graphs will depend on the similarity matrix .) For low values of (emphasizing rare species), the first community looks more diverse than the second. For high values of (emphasizing common species), it’s the opposite.
It turns out that many diversity measures previously used in ecology are special cases of the ones given above. Also, these measures have excellent mathematical properties. Lots are listed in our paper. Here I’ll give just two.
Naive model There’s a ‘naive’ model of an ecological community in which distinct species are always assumed to have nothing in common. This is a terribly crude assumption, and makes a community consisting of two species of slug as diverse as a community consisting of a slug and a giraffe.
Nevertheless, this is the model used by most diversity measures to date. It corresponds to taking . When you take using our measures, the formula for diversity is:
These are known in ecology as the Hill numbers, and in mathematics as the exponentials of the Rényi entropies. A lot is known about them. Even more is known about the case , which is the exponential of Shannon entropy.
Effective number Our measures are effective numbers, which means that a community of equally abundant, totally dissimilar species is assigned a diversity of . In symbols,
for all .
So if someone tells you ‘this community has diversity 26.2’, and they’re using an effective number measure, that means it’s slightly more diverse than a community of 26 equally abundant, totally dissimilar species. If they come to you a year later saying that its diversity has dropped to 13.1, that means, in a directly comprehensible way, that its diversity has halved. As Mark Hill (of the Hill numbers) put it, effective numbers ‘enable us to speak naturally’.
As far as I’m concerned, this work links together many of my interests involving measures of size. Apart from diversity being an important mathematical concept in itself, it’s related to entropy, power means, and magnitude of metric spaces.
As far as biologists are concerned, there seem to be two main points of interest.
One is the even-handed approach to the spectrum of possible viewpoints — treating all values of democratically, rather than choosing one and claiming that it’s the ‘best’. This leads to the graphical device of drawing graphs like the one above, in order to compare and contrast communities. I’m surprised that this has generated so much enthusiasm, because these graphs (‘diversity profiles’) have been advocated for a long time, by many different authors. But they also seem to be new to many people.
The other — and the reason behind the title of our paper — is that we’ve built inter-species similarity into the model. Ours isn’t the first diversity measure to do this, but it seems to be the most general. I’d like to explain the practical impact that this can have, but I’m running out of energy now, so I’ll leave that for another day.
Update (9 November 2011) John kindly let me write a version of this post for Azimuth. It’s actually quite different from what I’ve written here. The major thing that it has and this post doesn’t is an illustration of how taking species similarity into account can change your judgement on which of two communities is the more diverse.
Re: Measuring Diversity
Thanks, that’s a very clear post.
When writing
you link to the Wikipedia page on the intensive/extensive property distinction. We really should have such a page at nLab, especially as it is a favourite topic of Lawvere’s. I’ve been trying to grasp it through nForum discussions, e.g., here, and longer ago at the Café.
Does anyone here have a good category theoretic handle on it?