Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

February 13, 2014

Relative Entropy

Posted by John Baez

You may recall how Tom Leinster, Tobias Fritz and I cooked up a neat category-theoretic characterization of entropy in a long conversation here on this blog. Now Tobias and I have a sequel giving a category-theoretic characterization of relative entropy. But since some people might be put off by the phrase ‘category-theoretic characterization’, it’s called:

I’ve written about this paper before, on my other blog:

  • Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.
  • Relative Entropy (Part 2): a category related to statistical inference, FinStat,\mathrm{FinStat}, and how relative entropy defines a functor on this category.
  • Relative Entropy (Part 3): statement of our main theorem, which characterizes relative entropy up to a constant multiple as the only functor F:FinStat[0,)F : \mathrm{FinStat} \to [0,\infty) with a few nice properties.

But now the paper is actually done! Let me give a compressed version of the whole story here… with sophisticated digressions buried in some parenthetical remarks that you’re free to skip if you want.

Our gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or—by people who like to use jargon to make their work seem obscure—‘Kullback-Leibler divergence’.

Here’s the basic idea. Whenever you have two probability distributions pp and qq on the same finite set X,X, you can define the entropy of qq relative to pp:

S(q,p)= xXq xln(q xp x) S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)

Here we set

q xln(q xp x)q_x \ln\left( \frac{q_x}{p_x} \right)

equal to \infty when p x=0,p_x = 0, unless q xq_x is also zero, in which case we set it equal to 0. Relative entropy thus takes values in [0,].[0,\infty].

Intuitively speaking, S(q,p)S(q,p) measures how surprised you’d be if you thought a situation was described by a probability distribution pp… but then someone came along and said no, it’s really qq.

Or if ‘surprise’ sounds too subjective, it’s the expected amount of information gained when you discover the probability distribution is really q,q, when you’d thought it was p.p.

Tobias and I wanted to use category theory to say what’s so great about relative entropy. We did it using a category FinStat\mathrm{FinStat} where:

  • an object (X,q)(X,q) consists of a finite set XX and a probability distribution xq xx \mapsto q_x on that set;
  • a morphism (f,s):(X,q)(Y,r)(f,s) : (X,q) \to (Y,r) consists of a measure-preserving function ff from XX to Y,Y, together with a probability distribution xs xyx \mapsto s_{x y} on XX for each element yYy \in Y, with the property that s xy=0s_{x y} = 0 unless f(x)=yf(x) = y.

If the raw math seems hard to swallow, perhaps some honey-coated words will help it go down. I think of an object of FinStat\mathrm{FinStat} as a system with some finite set of states together with a probability distribution on its states. This lets me think of a morphism

(f,s):(X,q)(Y,r) (f,s) : (X,q) \to (Y,r)

in a nice way. First, there’s a measurement process f:XYf : X \to Y, a function from the set XX of states of some system being measured to the set YY of states of some measurement apparatus. The condition that ff be measure-preserving says the probability that the apparatus winds up in any state yYy \in Y is the sum of the probabilities of all states of XX leading to that outcome:

r y= xf 1(y)q x \displaystyle{ r_y = \sum_{x \in f^{-1}(y)} q_x }

Second, there’s a hypothesis ss. This is a guess about the probability that the system being measured is in the state xXx \in X given any measurement outcome yY.y \in Y. The guess is the number s xys_{x y}.

Now, suppose we have any morphism

(f,s):(X,q)(Y,r) (f,s) : (X,q) \to (Y,r)

in FinStat.\mathrm{FinStat}. From this we get two probability distributions on XX. First, we have the probability distribution pp given by

p x= yYs xyr y \displaystyle{ p_x = \sum_{y \in Y} s_{x y} r_y } \qquad \qquad \heartsuit\heartsuit\heartsuit

This is our best guess about the the probability that the system is in any given state, given our hypothesis and the probability distribution of measurement results. Second, we have the ‘true’ probability distribution qq.

In fact, this way of assigning relative entropies to morphisms defines a functor

RE:FinStat[0,] RE : \mathrm{FinStat} \to [0,\infty]

where we use [0,][0,\infty] to denote the category with one object, the numbers 0x0 \le x \le \infty as morphisms, and addition as composition. More precisely, if

(f,s):(X,q)(Y,r) (f,s) : (X,q) \to (Y,r)

is any morphism in FinStat,\mathrm{FinStat}, we define

RE(f,s)=S(q,p) RE(f,s) = S(q,p)

where pp is defined as in equation \heartsuit\heartsuit\heartsuit. This tells us how surprised we are when we learn the true probability distribution qq, if our measurement results were distributed according to rr and our hypothesis was ss.

The fact that RERE is a functor is nontrivial and rather interesting! It says that given any composable pair of measurement processes:

(X,q)(f,s)(Y,r)(g,t)(Z,u) (X,q) \stackrel{(f,s)}{\longrightarrow} (Y,r) \stackrel{(g,t)}{\longrightarrow} (Z,u)

the relative entropy of their composite is the sum of the relative entropies of the two parts:

RE((g,t)(f,s))=RE(g,t)+RE(f,s). RE((g,t) \circ (f,s)) = RE(g,t) + RE(f,s) .

We prove that RERE is a functor. However, we go further: we characterize relative entropy by saying that up to a constant multiple, RERE is the unique functor from FinStat\mathrm{FinStat} to [0,][0,\infty] obeying three reasonable conditions.

Lower semicontinuity

The first condition is that RERE is lower semicontinuous. The set P(X)P(X) of probability distibutions on a finite set XX naturally has the topology of an (n1)(n-1)-simplex when XX has nn elements. The set [0,][0,\infty] has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function

S:P(X)×P(X) [0,] (q,p) S(q,p). \begin{array}{rcl} S : P(X) \times P(X) &\to& [0,\infty] \\ (q,p) &\mapsto & S(q,p) . \end{array}

The problem is that

S(q,p)= xXq xln(q xp x)\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }

and q xln(q x/p x)q_x \ln(q_x/p_x) is \infty when p x=0p_x = 0 and q x>0q_x > 0 — but it’s 00 when p x=q x=0.p_x = q_x = 0.

So, it turns out that SS is only lower semicontinuous, meaning that if p i,q ip^i , q^i are sequences of probability distributions on XX with p ipp^i \to p and q iqq^i \to q then

S(q,p)liminf iS(q i,p i) S(q,p) \le \liminf_{i \to \infty} S(q^i, p^i)

We give the set of morphisms in FinStat\mathrm{FinStat} its most obvious topology, and show that with this topology, RERE maps morphisms to morphisms in a lower semicontinuous way.

(Lower semicontinuity may seem like an annoying property. But there’s a way to redeem it. There’s a sneaky topology on [0,][0,\infty] such that a function taking values in [0,][0,\infty] is lower semicontinuous (in the lim inf sense above) if and only if it’s continuous with respect to this sneaky topology!

Using this idea, we can make FinStatFinStat and [0,][0,\infty] into topological categories — that is, categories internal to Top — in such a way that lower semicontinuity simply says

RE:FinStat[0,] RE : FinStat \to [0,\infty]

is a continuous functor.

A bit confusingly, this sneaky topology on [0,][0,\infty] is called the upper topology. I’ve fallen in love with the upper topology on [0,][0,\infty]. Why?

Well, [0,][0,\infty] is a very nice rig, or ‘ring without negatives’. Addition is defined in the obvious way, and multiplication is defined in the almost-obvious way, except that

0=0=0 0 \cdot \infty = \infty \cdot 0 = 0

Even this is actually obvious if you remember that it’s required by the definition of a rig. But if you try to put the ordinary closed interval topology on [0,][0,\infty], you’ll see multiplication is not continuous, because aa \cdot \infty is infinite when a>0a \gt 0 but then it suddenly jumps down to zero when aa hits zero. However, multiplication is continuous if we give [0,][0,\infty] the upper topology! Then [0,][0,\infty] becomes a topological rig.)

Convex linearity

The second condition is that RERE is convex linear. We describe how to take convex linear combinations of morphisms in FinStat,\mathrm{FinStat}, and then the functor RERE maps any convex linear combination of morphisms in FinStat\mathrm{FinStat} to the corresponding convex linear combination of numbers in [0,].[0,\infty].

Intuitively, this means that if we take a coin with probability PP of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is PP times the expected information gain of the first process plus 1P1-P times the expected information gain of the second process.

(Operadically, the point is that both FinStatFinStat and [0,][0,\infty] are algebras of an operad P whose operations are convex linear combinations. The nn-ary operations in P are just probability distributions on an nn-element set. In other words, they’re points in the (n1)(n-1)-simplex.

So, saying that RERE is convex linear means that

RE:FinStat[0,] RE: FinStat \to [0,\infty]

is a map of P-algebras. But we avoid discussing this in our paper because FinStatFinStat, being a category, is just a ‘weak’ P-algebra, and we decided this would be too much for our poor little readers.

For those who like fine nuances: P is a topological operad, and FinStatFinStat and [0,][0,\infty] are algebras of this in the topological category TopCat. As I mentioned, FinStatFinStat is a ‘weak’ P-algebra, meaning the laws for convex linear combinations hold only up to coherent natural isomorphism. [0,][0,\infty] is strict… but to get convex linear combinations like λ0+(1λ)\lambda \cdot 0 + (1 - \lambda) \infty to behave continuously, we have to give [0,][0,\infty] the upper topology!)

Vanishing on a subcategory

The third condition is that RERE vanishes on morphisms (f,s):(X,q)(Y,r)(f,s) : (X,q) \to (Y,r) where the hypothesis ss is optimal. By this, we mean that equation \heartsuit\heartsuit\heartsuit gives a probability distribution pp equal to the ‘true’ one, qq.

That makes a lot of sense conceptually: we don’t gain any information upon learning the truth about a situation if we already knew the truth!

(But the subcategory of FinStatFinStat where we keep all the objects but only these ‘optimal’ morphisms also has a nice category-theoretic significance. Tom Leinster called it FP in this post:

That’s because it’s the ‘free P-algebra on an internal P-algebra’, where P is the operad I mentioned. I won’t explain what this means here, because Tom did it! Suffice it to say that it’s a shockingly abstract piece of operad theory that nonetheless manages to capture the concept of entropy very neatly. But that’s plain old entropy, not relative entropy.)

The result

Here, then, is our main result:

Theorem. Any lower semicontinuous, convex-linear functor

F:FinStat[0,] F : \mathrm{FinStat} \to [0,\infty]

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant c[0,]c \in [0,\infty] such that

F(f,s)=cRE(f,s) F(f,s) = c RE(f,s)

for any any morphism (f,s):(X,p)(Y,q)(f,s) : (X,p) \to (Y,q) in FinStat.\mathrm{FinStat}.

The proof

The proof is surprisingly hard. Or maybe we’re just surprisingly bad at proving things. But the interesting thing is this: the proof is swift and effective in the ‘generic’ case — the case where the support of the probability measures involved is the whole set they’re living on, and the constant cc is finite.

It takes some more work to handle the case where the probability measures have smaller support.

But the really hard work starts when we handle the case that, in the end, has c=c = \infty. Then the proof becomes more like analysis than what you normally expect in category theory. We slowly corner the result, blocking off all avenues of escape. Then we close in, grab its neck, and strangle it, crushing its larynx ever tighter, as it loses the will to fight back and finally expires… still twitching.

You’ve got to read the proof to understand what I mean.

Posted at February 13, 2014 3:48 PM UTC

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2699

40 Comments & 0 Trackbacks

Re: Relative Entropy

You might include the name “cross-entropy” among your list – I saw it used a lot when I was thinking about these things.

Posted by: Allen Knutson on February 13, 2014 6:15 PM | Permalink | Reply to this

Re: Relative Entropy

I’m experiencing déjà vu.

Posted by: John Baez on February 13, 2014 6:26 PM | Permalink | Reply to this

Re: Relative Entropy

I had (and occasionally still have) trouble remembering all the different types of “two-fold” entropy: joint entropy, cross entropy, relative entropy, mutual information, etc., and of course it’s not helped by some of them having multiple names. Here’s something information-theoretic that Richard Reeve told me, clarifying the difference between cross entropy and relative entropy:

Cross entropy is the number of bits you need to encode a message in the wrong language.

Relative entropy is the number of extra bits you need to encode a message in the wrong language.

Posted by: Tom Leinster on February 13, 2014 7:00 PM | Permalink | Reply to this

Re: Relative Entropy

Speaking of not-quite-synonyms, what you call a stochastic map in the paper is, in the special case X=YX = Y, what I’d call a Markov kernel. That might be worth mentioning parenthetically, since the same readers who’d like “Bayesian” better than “category-theoretic” may also find “Markov” to be a comfortingly familiar name.

Posted by: Mark Meckes on February 17, 2014 6:51 AM | Permalink | Reply to this

Re: Relative Entropy

I saw the Kullback-Leibler divergence mentioned in a colloquium last year and I thought at the time it cried out for a category-theoretic interpretation. Well done!

Since you are considering functors FinStatFinStat that invert every morphism in FPFP (since 0[0,]0\in [0,\infty] is the only invertible arrow!), you are really talking about functors with domain category FinStat[FP 1]FinStat[FP^{-1}]. However, since everything is meant to take place in internal categories in TopTop, this localisation is not something I’ve seen done.

Posted by: David Roberts on February 14, 2014 12:29 AM | Permalink | Reply to this

Re: Relative Entropy

David wrote:

you are really talking about functors with domain category FinStat[FP 1]FinStat[FP^{-1}].

That’s a very interesting observation! It may lead to an even neater formulation of our result; let me explain.

In FinStatFinStat, there is a canonical FPFP-arrow from every object (Y,q)(Y,q) to the one-element probability space (1,1)(\mathbf{1},1), which is displayed in the diagram at the bottom of p.15. This means that the localization FinStat[FP 1]FinStat[FP^{-1}] is a category in which all objects are isomorphic. In other words, it’s a category equivalent to a monoid! I bet that if we take the localization FinStat[FP 1]FinStat[FP^{-1}] within the 22-category of (weak) algebras in Cat(Top)Cat(Top) of the convex combinations operad PP (see Appendix B), then that monoid will turn out to be the interval [0,][0,\infty] — the codomain of our relative entropy functor! I must be hallucinating… Let me give it a try nevertheless:

Conjecture: Relative entropy RE:FinStat[0,]RE : FinStat\to [0,\infty] is the universal map of topological weak PP-algebras which inverts FPFP.

Until now, we even haven’t checked whether FinStatFinStat actually is a weak PP-algebra. So, there’s lots to do…

By the way, our paper has appeared on the arXiv.

Posted by: Tobias Fritz on February 14, 2014 1:41 AM | Permalink | Reply to this

Re: Relative Entropy

As a first check, one could see if RERE is a localisation just of categories, a la Gabriel-Zisman. Even if this isn’t true, it doesn’t preclude the stronger statement of your conjecture; but if true would help in showing the conjecture.

Posted by: David Roberts on February 14, 2014 3:55 AM | Permalink | Reply to this

Re: Relative Entropy

Tobias, remember when you said this:

Secondly, potentially one can still see the entropy functor as a sort of decategorification: it’s a kind of decategorification which retains the morphisms, but throws away the objects by mapping the whole category down to a monoid! Then due to Eckmann-Hilton, one should expect the functor to identify the tensor product of morphisms with composition of morphisms. Looking at Theorem MM, that’s precisely what entropy does!

So perhaps you were seeing a glimpse of your conjecture through the fog, as it were…

Posted by: David Roberts on February 14, 2014 7:06 AM | Permalink | Reply to this

Re: Relative Entropy

some people might be put off by the phrase ‘category-theoretic characterization’

And you didn’t worry about putting off other people with ‘Bayesian characterization’?

More seriously, since I’m too lazy or busy or something to check in the paper myself at the moment, is there anything in there about more general Rényi divergences?

Posted by: Mark Meckes on February 14, 2014 6:21 AM | Permalink | Reply to this

Re: Relative Entropy

Mark wrote:

And you didn’t worry about putting off other people with ‘Bayesian characterization’?

I’m betting that including that phrase will increase the readership of our paper over the bare title ‘A characterization of relative entropy’, while ‘category-theoretic characterization’ would decrease it. Lure ‘em in with Bayesianism, then sock ‘em with categories — that’s the plan.

More seriously, since I’m too lazy or busy or something to check in the paper myself at the moment, is there anything in there about more general Rényi divergences?

No! — and that’s an interesting point. Our earlier paper with Tom used the obvious convex structure on [0,)[0,\infty) to characterize Shannon entropy as the unique convex-linear continuous functor

F:FinProb[0,)F : FinProb \to [0,\infty)

up to a scalar multiple.

But Tom noticed that there’s a 1-parameter family of convex structures on [0,)[0,\infty) (compatible with its structure as an additive monoid), and using these instead, our theorem gives the Tsallis entropies!

(As you know but others many not, Tsallis entropies and Rényi entropies are two packagings of the same basic thing.)

But in this paper, relative Tsallis and Rényi entropy don’t seem to show up, even if you allow variant convex structures on [0,][0,\infty]. The Shannon formula

ip ilnp i-\sum_i p_i \ln p_i

shows up purely on the basis of functoriality — we don’t need convex-linearity to get it! This puzzled me.

We use convex-linearity later in our argument, but now I’m wondering if it’s truly necessary. This is especially true given the observation of David and Tobias here that relative Shannon entropy

RE:FinStat[0,] RE : FinStat \to [0,\infty]

inverts every arrow in the subcategory FPFP. It’s possible that we could completely drop reference to convex-linearity and boost Tobias’ conjecture to:

Conjecture 2: Relative entropy RE:FinStat[0,]RE : FinStat\to [0,\infty] is the universal map of topological categories that inverts all morphisms in the subcategory FPFP.

I’ve been vaguely hoping that (some nonempty subset of) Tom, Tobias and I would get back together and write yet another paper on this stuff. Now I sense another paper coming on.

Posted by: John Baez on February 14, 2014 10:33 AM | Permalink | Reply to this

Re: Relative Entropy

I’ve been vaguely hoping that (some nonempty subset of) Tom, Tobias and I would get back together and write yet another paper on this stuff.

That would be pleasant, but I’d be even more impressed to see the empty subset write a paper on this stuff.

Posted by: Tom Leinster on February 14, 2014 11:24 AM | Permalink | Reply to this

Re: Relative Entropy

If the empty set wrote a paper, it would probably have trouble putting it on the arXiv. So we could take credit its work. I don’t think it would mind.

Posted by: John Baez on February 14, 2014 11:28 AM | Permalink | Reply to this

Re: Relative Entropy

John wrote:

The Shannon formula ip ilnp i-\sum_i p_i\ln p_i shows up purely on the basis of functoriality.

I’m not exactly sure what you’re referring to. We do use convex linearity for deriving the relative entropy formula

ip iln(p iq i),\sum_i p_i\ln\left(\frac{p_i}{q_i}\right),

even in the easy generic case in which no infinities arise. This happens on p.18 after saying that “the upper horizontal morphism is actually a convex linear combination”.

If we try to use Tom’s deformed convex structures on [0,][0,\infty] instead, then we would still obtain the functional equation (4.2) and therefore logarithms. But the derivation on p.18 doesn’t seem to make sense anymore: the α\alpha in the second-to-last equation would fail to cancel out, and we would not obtain any formula for the value of FF applied to a morphism.

Maybe this sort of thing is what you meant, John. I am intrigued by your Conjecture 2!

Posted by: Tobias Fritz on February 14, 2014 1:00 PM | Permalink | Reply to this

Re: Relative Entropy

Tobias wrote:

I’m not exactly sure what you’re referring to.

I was referring to the derivation of equation (4.2). We talked about this before: I said that even with a deformed convex structure on [0,][0,\infty], our argument gives the functional equation for the logarithm, thus — apparently — ruling out relative Rényi or Tsallis entropy as possibilities.

I made a mistake above: we don’t get the formula

ip ilnp i - \sum_i p_i \ln p_i

from this little calculation. But we do get logarithms. This already seems to rule out relative Rényi or Tsallis entropy, which don’t seem to have logarithms in them.

Now you seem to be saying that a modified convex structure gives no functor at all obeying our conditions! That would be interesting. Somehow Shannon entropy seems much better than the others in the relative context. Fans of relative Rényi and Tsallis entropy have some work to do, to rescue them.

Posted by: John Baez on February 14, 2014 1:56 PM | Permalink | Reply to this

Re: Relative Entropy

John wrote:

I was referring to the derivation of equation (4.2) [..] thus — apparently — ruling out relative Rényi or Tsallis entropy as possibilities.

Sorry, but now I’m failing to see this. Relative Rényi entropy, let’s denote it S qS_q, also contains a logarithm! And in fact, if you take its defining equation and plug in p=(1,0)p=(1,0) and q=(α,1α)q=(\alpha,1-\alpha) as in the definition of our function gg, you also getS(p,q)=1q1log(1 qα 1q+0 q(1α) 1q)=logα.S(p,q) = \frac{1}{q-1} \log\left(1^q\cdot \alpha^{1-q} + 0^q\cdot (1-\alpha)^{1-q}\right) = -\log\alpha.So it seems to me that that part of our proof would still be fine…

I agree that this doesn’t work for Tsallis entropies, though, since these use qq-deformed logarithms. Is it possible to qq-deform the functional equation g(xy)=g(x)+g(y)g(xy)=g(x)+g(y) such as to obtain a characterization of the qq-logarithm?

Posted by: Tobias Fritz on February 14, 2014 4:46 PM | Permalink | Reply to this

Re: Relative Entropy

Yes, it is possible to generalize the functional equation, which results in

log q(xy)=log q(x)+log q(y)+(1q)log q(x)log q(y)\log_q (xy) = \log_q(x) + \log_q(y) + (1-q) \log_q(x) \log_q(y)Seee.g.http://arxiv.org/pdf/condmat/0410270.pdfTheresagooddealofworkonUnknown characterqcalculusUnknown characterorUnknown characterqarithmeticUnknown character(unfortunatelythesetermshavemultiplemeanings...).Thisstartswiththeobviousgeneralizationofadditiontoqaddition: See e.g. http://arxiv.org/pdf/cond-mat/0410270.pdf There's a good deal of work on "q-calculus" or "q-arithmetic" (unfortunately these terms have multiple meanings...). This starts with the obvious generalization of addition to q-addition: x +_q y = x + y + (1-q)xy$

Posted by: Marc Harper on February 18, 2014 1:58 AM | Permalink | Reply to this

Re: Relative Entropy

Thanks, Marc, that’s very useful! So maybe we’ll have to qq-deform our functoriality assumption to ‘qq-functoriality’ which is based on qq-addition,F q(gf)=F q(g)+F q(f)+(1q)F q(g)F q(f).F_q(g\circ f) = F_q(g) + F_q(f) + (1-q) F_q(g) F_q(f).Just another wacky idea… Somebody should figure out whether relative Tsallis entropy on our category FinStat is qq-functorial in this sense!

Posted by: Tobias Fritz on February 18, 2014 2:50 AM | Permalink | Reply to this

Re: Relative Entropy

Wow, you’re right! I was completely confused.

It’s strange how when I tried to tell you the same things in email, you didn’t seem very interested or point out all my confusions, but now in public you are!

Maybe you were just focused on finishing the paper, instead of thinking about how it does or does not generalize to relative Rényi or Tsallis entropy. Maybe that was good… but either our theorem generalizes to these other kinds of entropy, or there’s something ‘bad’ about these entropies, or there’s something subtly ‘bad’ about our theorem.

(Not that it’s false, just that the framework is somehow biased towards Shannon entropy.)

Posted by: John Baez on February 15, 2014 6:13 PM | Permalink | Reply to this

Re: Relative Entropy

So how could one go about proving or disproving John’s Conjecture 2? Here is one idea. I’m using notation which is different from the paper now in order to keep things concise and conceptually clear.

One can try to show that if two morphisms f:ABf:A\to B and f:ABf':A'\to B' in FinStat have the same value under the relative entropy functor, then there is a diagram of the form

(1)A f B C D A f B\begin{matrix} A & \stackrel{f}{\to} & B \\ \uparrow && \uparrow \\ C & \to & D \\ \downarrow && \downarrow \\ A' & \stackrel{f'}{\to} & B'\end{matrix}

in which all vertical arrows are in FP. This property would guarantee that ff and ff' become isomorphic in the localization FinStat[FP 1]FinStat[FP^{-1}]. And I think that this is the right kind of diagram to look at since one can choose CC to be much bigger than AA and AA' and thereby “cover” these two objects by the big object CC, so that one has a lot of leeway in choosing the probability distribution and the morphism CDC\to D. Similarly, one should try to take DD to be much bigger than BB and BB'.

So far, this ignores the adjective “topological” in John’s Conjecture 2. So in order to get this approach to work, one will probably have to allow the above diagram to be approximate in some sense; I don’t yet know what that could mean.

Independently of this approach, it might also be good to know whether FinStat[FP 1]FinStat[FP^{-1}] arises from a calculus of fractions.

Posted by: Tobias Fritz on February 15, 2014 6:08 PM | Permalink | Reply to this

Re: Relative Entropy

It’s a great result — congratulations!

I think the aspect I understand the least is the category FinStat. I feel slightly guilty about that, because I’m certain you’ve blogged about it and written about it in the paper. (Perhaps I’ve even read about it, but not fully digested the idea.) Rather than ask you to repeat yourself, can you (John or Tobias) point me to an explanation that would complement the one in the blog post above?

I find it hard to say what it is about FinStat that I’m not digesting. The explanation in terms of measurements and hypotheses is perfectly reasonable, but in some sense I can’t put my finger on, there’s something I’m having trouble swallowing.

I too am intrigued by the apparent difficulty of proving the result. However, that’s slightly tempered by the fact that in our joint paper characterizing entropy, the main result we depended on (Faddeev’s) is also quite difficult to prove. If I remember correctly, the proof was gradually improved over the years.

Posted by: Tom Leinster on February 14, 2014 11:32 AM | Permalink | Reply to this

Re: Relative Entropy

Tom wrote:

I think the aspect I understand the least is the category FinStatFinStat. […] Rather than ask you to repeat yourself, can you (John or Tobias) point me to an explanation that would complement the one in the blog post above?

I bet you’d like the explanation that uses more diagrams; this also fits FinStatFinStat into more of an overall context along with FinProbFinProb and FP. I blogged about it on Azimuth:

  • Relative Entropy (Part 2): a category related to statistical inference, FinStat\mathrm{FinStat}, and how relative entropy defines a functor on this category.

A more detailed version of the same tale starts on page 4 of our paper.

Posted by: John Baez on February 14, 2014 2:04 PM | Permalink | Reply to this

Re: Relative Entropy

Thanks. I certainly read Relative Entropy (Part 2) before; I remember those wiggly triangles well.

But I see you’ve missed something out of the definition of FinStat in your post above. You said that a map (X,q)(Y,r)(X, q) \to (Y, r) is a measure-preserving function f:XYf: X \to Y together with a probability distribution s y=(s xy) xXs_{\bullet y} = (s_{x y})_{x \in X} on XX for each yYy \in Y. You didn’t say that ff and ss were required to be compatible in any way, which was one of the things that raised my eyebrows.

But in your paper, you add the extra requirement that fs=1 Yf \circ s = 1_Y, which (according to your Lemma 4) is equivalent to asking that the distribution s ys_{\bullet y} is supported on f 1(y)f^{-1}(y), for each yYy \in Y.

So, a map (X,q)(Y,r)(X, q) \to (Y, r) in FinStat could be described as a measure-preserving function XYX \to Y together with a probability distribution on each fibre.

Your Lemma 4 surprises me!

Posted by: Tom Leinster on February 14, 2014 4:10 PM | Permalink | Reply to this

Re: Relative Entropy

Tom wrote:

Your Lemma 4 surprises me!

Why? Because it doesn’t generally hold if you start with an arbitrary operad and construct the analogue of our category FinStat?

In terms of algebraic theories, I think that Lemma 4 is closely related to the following property of the theory of convex combinations: consider a family of nn-ary operations,(x 1,,x n)s i(x 1,,x n),i=1,,m(x_1,\ldots,x_n)\mapsto s_i(x_1,\ldots,x_n),\quad i=1,\ldots,mwhich are composed with an mm-operation rr to form the nn-ary operation(x 1,,x n)r(s 1(x 1,,x n),,s m(x 1,,x n)).(x_1,\ldots,x_n)\mapsto r(s_1(x_1,\ldots,x_n),\ldots,s_m(x_1,\ldots,x_n)).Now if this composite operation has the property that it is independent of some x jx_j, then it follows that for each k=1,,mk=1,\ldots,m, rr is independent of its kkth argument, or s ks_k is independent of x jx_j. I think of this as some kind of positivity property. It fails, for example, for the theory of abelian groups, in which the two operations (x 1,x 2)x 1+x 2(x_1,x_2)\mapsto x_1 + x_2 and (x 1,x 2)x 1x 2(x_1,x_2)\mapsto -x_1-x_2 can be trivially combined to obtain the zero operation, which is independent of all its arguments.

Has anyone seen this kind of thing before? Can it be generalized to a property of (not necessarily finitary) monads?

By the way, I have to admit that I also find FinStat a bit mysterious without being able to point my finger at anything specific…

Posted by: Tobias Fritz on February 15, 2014 5:25 PM | Permalink | Reply to this

Re: Relative Entropy

Tom wrote:

You didn’t say that ff and ss were required to be compatible in any way, which was one of the things that raised my eyebrows.

Whoops. This blog article was based on another blog article which was based on an old version of the introduction that accidentally omitted that clause. I actually meant to fix that other blog article, but got distracted and forgot.

I’ll fix ‘em both now, by adding the bit in boldface here:

a morphism (f,s):(X,q)(Y,r)(f,s) : (X,q) \to (Y,r) consists of a measure-preserving function ff from XX to Y,Y, together with a probability distribution xs xyx \mapsto s_{x y} on XX for each element yYy \in Y, with the property that s xy=0s_{x y} = 0 unless f(x)=yf(x) = y.

This is equivalent to saying that the stochastic map ss is a ‘section’ for the function ff.

Posted by: John Baez on February 15, 2014 6:27 PM | Permalink | Reply to this

Re: Relative Entropy

Tom wrote:

Your Lemma 4 surprises me!

From a certain viewpoint it should’t be surprising at all. Suppose f:XYf: X \to Y and s:YXs: Y \to X are functions. Then this lemma would say ss is a right inverse of f:XYf : X \to Y iff for each yYy \in Y we have s(y)f 1(y)s(y) \in f^{-1}(y) .

When we generalize and let s:YXs: Y \to X be a stochastic map, then ss is a right inverse of ff iff for each yYy \in Y we have s(y)f 1(y)s(y) \in f^{-1}(y) with probability 1.

Here I’m thinking of a stochastic map s:YXs: Y \to X as a map that randomly sends yYy \in Y to different points xXx \in X with different probabilities s xys_{x y}. This is a good way to think about it.

(Unfortunately on this blog the \leadsto symbol doesn’t work — this should give a squiggly arrow. In our paper we use this symbol for stochastic maps, since they feel more squiggly than actual deterministic functions: you’re not quite sure what they’ll do.)

Posted by: John Baez on February 15, 2014 6:45 PM | Permalink | Reply to this

Re: Relative Entropy

I hadn’t heard of \leadsto; is it this: \rightsquigarrow?

(I don’t need to tell you how I did that.)

Posted by: Tom Leinster on February 15, 2014 9:21 PM | Permalink | Reply to this

Re: Relative Entropy

Hmm, you’re using ‘\rightsquigarrow’ to get

\rightsquigarrow

I believe ‘\leadsto’ comes with the AMS symbols, and in our paper it comes out looking less insanely wiggly, like at the bottom of this table:

though in some other tables \rightsquigarrow and \leadsto look the same.

Posted by: John Baez on February 15, 2014 10:10 PM | Permalink | Reply to this

Re: Relative Entropy

Hi everyone,

I was recently working on something completely unrelated to entropy when out popped the entropy function. Since entropy has come up here again, I was wondering if you could tell me your thoughts on a few things to help me decide whether this is a coincidence or not.

1. Given a measure pp, let p tp^t denote the new measure on the same sample set where (p t) i(p^t)_i is defined to be (p i) t(p_i)^t. This is multiplicative in tt: (p t) u=p tu(p^t)^u=p^{tu}. Now take the derivative of the total mass of p tp^t at t=1t=1 but with respect to logt\log t, to make time additive:

lim t1 ip i t ip ilogt= ip ilogp i,\lim_{t\to 1} \frac{\sum_i p_i^t- \sum_i p_i}{\log t} = \sum_i p_i \log p_i,

which is your entropy, up to a sign. Do p tp^t and this computation come up anywhere in the entropy world? It seems similar to the Tallias entropy but also seems to be not exactly the same.

2. Suppose we let the sample set be infinite. Then it is possible for ip i\sum_i p_i to converge, but ip ilogp i\sum_i p_i \log p_i to diverge. Is this something that comes up in the entropy world? In my world, this is an unfortunate pathology (albeit a pretty mild one). So if this is familiar to you, I’d love to know how you think about it or, even better, get around it.

3. Now consider a sequence p[n]p[n] of probability measures on {1,2,3,}\{1,2,3,\dots\}, where p[n] ip[n]_i is 1/n1/n for i=1,,ni=1,\dots, n and is 00 for in+1i\ge n+1. Then the total mass of p[n] tp[n]^t converges to 00 if t1t\gneq 1 but is 11 for all nn if t=1t=1. You might say p[n]p[n] converges to some fictitious probability measure qq such that q tq^t is zero for all t0t\neq 0. Is this discontinuity a familiar phenomenon? Does entropy have anything to say about it?

Thoughts on any of this from the point of view of entropy, rather than just basic real analysis, would be much appreciated!

Posted by: James Borger on February 16, 2014 2:57 AM | Permalink | Reply to this

Re: Relative Entropy

Hi, James!

You wrote:

Do p tp^t and this computation come up anywhere in the entropy world?

I believe so! First note that your formula

lim t1 ip i t ip ilogt= ip ilogp i, \lim_{t\to 1} \frac{\sum_i p_i^t- \sum_i p_i}{\log t} = \sum_i p_i \log p_i,

says essentially the same thing as

lim t1 ip i t ip it1= ip ilogp i, \lim_{t\to 1} \frac{\sum_i p_i^t- \sum_i p_i}{t -1} = \sum_i p_i \log p_i,

or in other words

ddt ip i t| t=1= ip ilogp i, \left. \frac{d}{d t} \sum_i p_i^t \;\right|_{t = 1} = \sum_i p_i \log p_i,

Second, the quantity

Z(t)= ip i t Z(t) = \sum_i p_i^t

is known in statistical mechanics as the ‘partition function at temperature tt’ in situations where our original probability distribution pp is called the ‘Boltzmann distribution at temperature t=1t = 1’ and the partition function had been normalized to 11 at this temperature.

In this context, the above formula expressing entropy as a derivative is a special case of a general formula, famous in physics, relating entropy to the derivative of the partition function!

I believe that in a similar way, Tsallis entropy is a qq-derivative of the partition function. I wrote a little paper once, expressing Rényi entropy as a qq-derivative of a related quantity called the ‘free energy’. I’ve been meaning to do more with these ideas someday.

If this intrusion of physics seems intimidating or digressive, I should emphasize that statistical mechanics is just the application of probability theory to physics. Its core ideas apply to any situation where you don’t have complete knowledge of what’s going on. It just happens that the physicists got some of the ideas first.

In a different community of researchers, namely statisticians, the guys p i tp_i^t are called an ‘escort distribution’—after you normalize them to be a probability distribution.

Posted by: John Baez on February 16, 2014 11:47 AM | Permalink | Reply to this

Re: Relative Entropy

Thanks, John! I’ll check these out and report back if I have any follow up questions.

Posted by: James Borger on February 17, 2014 9:54 PM | Permalink | Reply to this

Re: Relative Entropy

Very interesting work, John, Tobias, and Tom! I especially like the point of view of thinking of relative entropy as a decategorification of a more general notion, which is to be identified.

Which makes me wonder. It has always seemed to me that the most fundamental object in information theory is Shannon’s self-information or “surprise” function which assigns to an event of probability pp the surprise log1p\log \frac 1 p. Indeed, it was the starting point for Shannon’s axiomatic development of his ideas in his famous paper:

C.E. Shannon, A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp 379–423, (Part I), 1948.

The nice thing about self-information is that it may be viewed as a random variable, which makes it a much richer object! Shannon entropy is the expectation (or L 1L^1 norm) of this random variable. I may be wrong, but the one time that I took a look at Renyi entropy, I recall thinking that they were precisely the L pL^p norms of self-information. (I may possibly even have gotten this from reading one of your blog posts, I don’t recall anymore.) If this is true, then working with self-information might be the right level of generality to unite results in these various entropies.

So perhaps self-information is the right starting point for a dedecategorification of information theory. In particular, the functoriality property that you observed reminds me very much of Shannon’s axiomatic treatment of self-information, where he concludes that, up to a multiplicative constant, log1p\log \frac 1 p is the unique function satisfying some rather innocuous properties.

Posted by: Manoj Gopalkrishnan on February 16, 2014 5:39 AM | Permalink | Reply to this

Re: Relative Entropy

Hello Manoj,

I like what you said about self information. I was just working on a diagram about it. Does what you’re thinking apply to this?

https://docs.google.com/file/d/0B9LMgeIAqlIEUGhBYXF2SVRnZ2s/edit?usp=docslist_api

Posted by: lee bloomquist on February 18, 2014 1:42 AM | Permalink | Reply to this

Re: Relative Entropy

Thanks for the feedback, Manoj!

I agree that the self-information function is a natural concept. For example, one can rewrite the partition function using self-information asZ(β)= ie βlog(1/p i)Z(\beta) = \sum_i e^{-\beta\cdot\log(1/p_i)}In terms of statistical physics, this means that one can think of log(1/p i)\log(1/p_i) as the energy of the system when it is in state ii! But beyond this, I have to admit that I don’t yet see the role that you’re envisioning for self-information, but of course I could be wrong.

the one time that I took a look at Renyi entropy, I recall thinking that they were precisely the L pL^p norms of self-information.

I don’t think that’s correct. Since pp already stands for the probability distribution, I will talk about the L αL^\alpha-norm of the self-information function, using the probabilities p ip_i as weights as you seem to have in mind. This norm takes on the form( ip i(log1p i) α) 1/α,\left( \sum_i p_i \left(\log\tfrac{1}{p_i}\right)^\alpha\right)^{1/\alpha},whereas Rényi entropy of order α\alpha is given by the different formula11αlog( ip i α).\frac{1}{1-\alpha}\log\left(\sum_i p_i^\alpha\right).I think I’ve never seen your quantity before, but I’m sure that someone must have thought about it and figured out what it means!

So perhaps self-information is the right starting point for a dedecategorification of information theory.

My vision is that information theory will someday have a category-theoretical foundation, and this should constitute a kind of ‘categorification’ of information theory. I suspect that in order to achieve this, we will have to abandon the idea of measuring the amount of information in a random variable by assigning a real number to each distribution — or even a real-valued function like self-information. Rather, categorical information theory should be built on a category of processes, where a process takes a random variable as input, applies some kind of information processing, and spits out another random variable! Then, I imagine that Shannon’s source coding theorem can be rephrased as something like this: if we take the limit of composing a process many times in parallel with itself, then the isomorphism classes of objects in this new ‘asymptotic’ category are in bijection with numbers in [0,)[0,\infty) via entropy.

Posted by: Tobias Fritz on February 18, 2014 2:45 AM | Permalink | Reply to this

Re: Relative Entropy

Hi Tobias,

I found the connection between Renyi entropies and L αL^\alpha-norms. You are right in pointing out that what I wrote above was not correct, but there is a connection, which is explained in the Wikipedia article for Renyi entropy. The connection is H α(X)=α1αlog(X α)H_\alpha(X)=\frac{\alpha}{1-\alpha} \log \left(\|X\|_\alpha\right) where H αH_\alpha is the Renyi entropy of order α\alpha, and XX is to be interpreted as the vector (p 1,p 2,p n)(p_1,p_2\dots,p_n) of probabilities.

I like your thoughts on a categorical information theory!

Posted by: Manoj Gopalkrishnan on February 20, 2014 9:19 AM | Permalink | Reply to this

Re: Relative Entropy

What are the consequences of this to irreproducible quasi stationary non equilibrium thermodynamic systems like climate?

In this context “irreproducibility” is defined as ability of some microstates belonging to the same macrostate to evolve to different macrostates in a short time, which is indeed the case with chaotic dynamics, so Jaynes entropy can’t easily be defined.

Earth is not black, although entropy production could be increased by lowering its albedo, as most of it occurs when incoming shortwave radiation gets absorbed and thermalized.

Albedo is an internally regulated quantity, not an external parameter.

see

Journal of Climate, Volume 26, Issue 2 (January 2013)
The Observed Hemispheric Symmetry in Reflected Shortwave Irradiance
Aiko Voigt, Bjorn Stevens, Jürgen Bader and Thorsten Mauritsen

Posted by: Peter Berenyi on February 16, 2014 10:49 PM | Permalink | Reply to this

Re: Relative Entropy

Hi Peter! Sorry for the slow reply, I think both John and I have had a lot of other stuff to do…

What are the consequences of this to irreproducible quasi stationary non equilibrium thermodynamic systems like climate?

At the moment, I don’t see anything useful that we can say about such systems. It’s a long shot: we’re trying to lay the foundation for building tools which could eventually help us to understand complex systems better. So, there are many steps in between which will have to be taken, and we have barely started! The next step could be to see what our results say about the maximum entropy principle and its application to thermodynamic systems.

The other work of John and his collaborators on network theory seems more directly applicable to what you have in mind: for example, Quantum Techniques for Stochastic Mechanics provides a toolbox of techniques which are directly applicable to complex evolving systems made out of interacting parts!

Posted by: Tobias Fritz on February 20, 2014 4:47 PM | Permalink | Reply to this

Re: Relative Entropy

My point is how do you define entropy processes with no clear distinction between micro- and macrostates? That is, in case there is a wide transition zone between them, spanning many orders of magnitude in a scale invariant manner, moreover macrostates are not preserved by dynamics under any definition thereof.

Posted by: Peter Berenyi on February 21, 2014 10:58 AM | Permalink | Reply to this

Re: Relative Entropy

Peter wrote:

how do you define entropy processes with no clear distinction between micro- and macrostates?

Since I don’t know what an “entropy process” is, I have some difficulties with understanding the question. Do you mean to ask “how do you define entropy production for processes with no clear distinction between micro- and macrostates”?

Before addressing that question, I think we need a precise definition of the terms microstate and macrostate. In non-equilibrium thermodynamics, it’s easy to get confused about these things, and in fact I am somewhat confused about this!

Let’s restrict the discussion to systems having a finite number of states. At this level, a thermodynamic system is just this: a finite set, which we interpret as the set of possible states. For example, take a gas in a box having two halves:

particles in a box

As a simplified model of that system, let’s pretend that we live in a universe in which particles don’t have momentum, and the position of a particle is defined only as to whether it’s on the left or on the right side of the box. Then, a state is nothing but a function which assigns to each particle either an LL or an RR, depending on which side the particle is on. If there are nn particles, then there are 2 n2^n such states. These states are what we can call microstates, since they contain the complete microscopic information about the system.

On the other hand, a macrostate is a probability distribution over microstates. For example, if we count 55 particles on the left and 33 on the right, then there are (5+35)=56\binom{5+3}{5}=56 microstates which could give rise to that observation, so we would typically assign a probability of 1/561/56 to each of these. Calling this a “macrostate” is reasonable because the information which we have is typically of a macroscopic nature, and so we represent our knowledge of the system by a probability distribution over microstates.

Alternatively, if we also include dynamics in our model and consider probing the system on sufficiently large timescales, then we can describe the results of our probing also in terms of such a probability distribution (a macrostate), which corresponds to the time-average microstate.

With these definitions, our categories FinProb\mathbf{FinProb} and FinStat\mathbf{FinStat} describe systems that are in a certain macrostate, and certain kinds of maps between such systems.

So, at least in this technical sense, there is always a distinction between microstates and macrostates, since they are different kinds of objects! A macrostate is a probability distribution over microstates. It doesn’t matter whether a macrostate is an equilibrium state or not.

The entropy of a macrostate is simply the Shannon entropy of the macrostate as a probability distribution. Entropy production occurs when you take two systems, whose time-averages are described by certain macrostates, and you let them interact. Then the macrostate described the time-average of the composite system typically has an entropy bigger than the sum of the individual entropies.

What I am confused about: why is this total entropy typically bigger than the sum of the two individual entropies?

Does this all make sense to you?

Posted by: Tobias Fritz on February 23, 2014 4:01 PM | Permalink | Reply to this

Re: Relative Entropy

I am thinking that the entropy evaluation of Baez and Fritz is a demonstration that it is true in different mathematical formalism, so that it is like a demonstration that happen in different field of mathematical, physics, biology or chemistry, in the same time, and with different objects. I am thinking that the reverse is true, so that a evaluation that give entropy in some field can give a demonstration that can be proven in the category theory, so that the Fermi demonstration in Thermodynamics is similar to the category theory (it give same result with a different restricted demonstrations). I am thinking: is it possible to prove the theorem with a convex linearity with only a infinitesimal parameter, so that the demonstration is true for infinitesimal neighbourhoods?

Posted by: domenico on February 20, 2014 10:27 AM | Permalink | Reply to this

Post a New Comment