Category-Theoretic Characterizations of Entropy II
Posted by John Baez
We’re having a lively conversation about different notions related to entropy, and how to understand these using category theory. I don’t want to stop this conversation, but I have something long to say, which seems like a good excuse for a new blog post.
A while back, Tom Leinster took Fadeev’s characterization of Shannon entropy and gave it a slick formulation in terms of operads. I’ve been trying to understand this slick formulation a bit better, and I’ve made a little progress.
Tom’s idea revolves around a special law obeyed by the Shannon entropy. I’ll remind you of that law, or tell you about it for the first time in case you didn’t catch it yet. I’ll explain it in a very lowbrow way, as opposed to Tom’s beautiful highbrow way. Then, I’ll derive it from a simpler law obeyed by something called the ‘partition function’.
I like this, because I’ve got a bit of physics in my blood, and physicists know that the partition function is a very important concept. But I hope you’ll see: you don’t need to know physics to understand or enjoy this stuff!
A math puzzle
We begin with a sadly familiar problem:
Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution . If you go to the th restaurant, you then choose a dish from the menu according to some probability distribution . How surprising will your choice be, on average?
Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.
This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.
Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability is defined to be . Then, if there are a bunch of possible outcomes distributed according to some probability distribution , the ‘expected’ surprise or Shannon entropy is:
where is the probability of the th outcome. This is a weighted average where each surprise is weighted by its probability of happening.
So, we really do have a well-defined math puzzle, and it’s not even very hard.
Glomming together probabilities
But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution on the set of restaurants. Then you choose a meal according to some probability distribution . If there are restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call
A bit more formally:
Suppose is a probability distribution on the set and are probability distributions on finite sets , where . Suppose the probability distribution assigns a probability to each element , and suppose the distribution assigns a probability to each element .
Then we can glom together the probability distributions using and get a new probability distribution on the disjoint union of all the sets . The resulting probability for each element in the disjoint union is just .
These probabilities sum to 1 as they should:
Note: what ranges over depends on ! But it’s all kosher.
Glomming together numbers
There’s something even simpler than glomming together probability measures: we can glom together numbers!
Suppose we have a probability measure on the set and for each element we have a number . Then we can define a new number by
This is just the ‘weighted average’ of the numbers . We have already seen a weighted average in the definition of Shannon entropy, equation (2).
Glomming together entropies
Now we’re ready to answer the math puzzle:
On the left we’re glomming together a bunch of probability distributions using and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using . But there’s also an extra term: the entropy of !
In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely . It’ll be that plus the expected surprise of your choice of restaurant, namely .
You might not have expected such a simple formula. But it’s easy to prove:
Fadeev’s theorem
The formula for glomming together entropies is more important than you might think: it comes very close to characterizing Shannon entropy! In 1957, D. K. Fadeev published a result that we can restate as follows:
Theorem (Fadeev): Suppose is a function sending probability measures on finite sets to real numbers. Suppose that:
- is invariant under bijections.
- is continuous.
- .
Then is a real multiple of Shannon entropy.
In item 1 we’re using the fact that a bijection between finite sets together with a probability distribution on gives a probability distribution on ; we want these to have the same entropy. In item 2 we’re using the standard topology on to put a topology on the set of probability distributions on any -element set. For a proof of this theorem, see the beginning of Rényi’s 1961 paper.
What’s going on?
While this equation is cute:
it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map from probability distributions to numbers. So, if you’re algebraically inclined, you would want to be a ‘homomorphism’, obeying a law like this:
We see laws of this sort all over math. But the true law has an extra term. What’s going on?
The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.
Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.
For starters, consider another law that has an ‘extra term’:
In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ‘derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.
Let’s see how.
Glomming together partition functions
There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.
Suppose is a probability measure on a finite set . Let be the probability of the element . We can always write
where
is a function called the energy or Hamiltonian. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy to account for zero probabilities.
But physicists always go further and introduce a parameter which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make into a function of :
or if you prefer, simply
When , these numbers are just the probabilities . But when , there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the partition function:
But I won’t do that just now: I’ll let be the measure on the set that assigns the measure to the point .
Now everything is a function of ! But everything reduces to something familar at , so we can stretch our old notation without breaking it. We now have functions sending numbers to numbers, and a function sending numbers to measures on . These reduce to our original and at . The partition function is also a function of , and this equals 1 at .
But here’s the important thing: the partition function obeys a simpler law than entropy when we glom probability measures together! Suppose is a probability distribution on the set and are probability distributions on finite sets , where . Then
So, in fancy jargon, is an operad algebra homomorphism!
But I need to explain what this equation means. First of all, everything is a function of . Second, while and are only probability measures at , they’re perfectly fine measures at other values of , so all our ‘glomming’ formulas still make sense.
Let’s check to make sure we know what’s going on. An expression like could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers to measures on that set. So, we can extend and this way and then ‘glom’ them to get a measure-valued function of . On the other hand, we can glom them and then extend them. Luckily, the results agree.
Let’s see why. Suppose assigns the point the probability . When we extend this to a function of we get . Similarly, suppose assigns the point the probability . When we extend this to a function of we get . When we glom these, the measure of the point will be this function of :
But this equals
which is the result of glomming and then extending.
The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,
From partition function to entropy
How is the Shannon entropy of a probability distribution related to its partition function? Simple:
Here I must emphasize that I’m only talking about the Shannon entropy of the original probability distribution, ‘at ’. Physicists extend to a function of , along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating and a bit more complicated, so let’s not.
Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.
We start with the rule for glomming partition functions:
We differentiate with respect to and use the product rule, taking advantage of the fact that is linear in but also linear in each of the functions :
Now set . We can use equation (4) and the fact that all our partition functions equal 1 at :
Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:
and we have
so
so
So, we’ve got it!
The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!
But before we go further, let’s check rule (4). For starters,
But we’ve just seen that
so
The upshot
In our notes on the nLab you’ll see a deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!
Let be the groupoid of finite probability measure spaces (where, for technical reasons, we exclude measures that assign the measure zero to a nonempty set). Then can be thought of as a functor from to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, assigns to each object its partition function … but we can recover up to isomorphism from .
Now, is an algebra of a certain operad whose -ary operations are the probability measures on the set . This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a -algebra. Furthermore, becomes a homomorphism of -algebras. This last statement mostly just says that
But then we can take , differentiate it with respect to , and set . By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the -algebra . Concretely, we’ve seen this mostly says that
But there should also be an abstract-nonsense theory of derivations of operad algebras. In fact I discussed this a bit back in week299, but only a little. So, there’s a lot I don’t understand, such as:
How does my ‘derivation’ way of thinking about the law relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?
And if we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of q-derivative of minus the logarithm of the partition function. The q-derivative is a kind of ‘twisted derivation’, but I bet a bunch of the same calculations work in some modified form.
Re: Category-Theoretic Characterizations of Entropy II
All of this seems like a very complicated way of saying that P(X,Y)=P(Y|X)P(X).