Entropy Modulo a Prime (Continued)
Posted by Tom Leinster
In the comments last time, a conversation got going about -adic entropy. But here I’ll return to the original subject: entropy modulo . I’ll answer the question:
Given a “probability distribution” mod , that is, a tuple summing to , what is the right definition of its entropy
How will we know when we’ve got the right definition? As I explained last time, the acid test is whether it satisfies the chain rule
This is supposed to hold for all and , where is the hyperplane
whose elements we’re calling “probability distributions” mod . And if God is smiling on us, will be essentially the only quantity that satisfies the chain rule. Then we’ll know we’ve got the right definition.
Black belts in functional equations will be able to use the chain rule and nothing else to work out what must be. But the rest of us might like an extra clue, and we have one in the definition of real Shannon entropy:
Now, we saw last time that there is no logarithm mod ; that is, there is no group homomorphism
But there is a next-best thing: a homomorphism
This is called the Fermat quotient , and it’s given by
Let’s go through why this works.
The elements of are the congruence classes mod of the integers not divisible by . Fermat’s little theorem says that whenever is not divisible by ,
is an integer. This, or rather its congruence class mod , is the Fermat quotient. The congruence class of mod determines the congruence class of mod , and it therefore determines the congruence class of mod . So, defines a function . It’s a pleasant exercise to show that it’s a homomorphism. In other words, has the log-like property
for all integers not divisible by .
In fact, it’s essentially unique as such. Any other homomorphism is a scalar multiple of . (This follows from the classical theorem that the group is cyclic.) It’s just like the fact that up to a scalar multiple, the real logarithm is the unique measurable function such that , but here there’s nothing like measurability complicating things.
So: functions as a kind of logarithm. Given a mod probability distribution , we might therefore guess that the right definition of its entropy is
where is an integer representing .
However, this doesn’t work. It depends on the choice of representatives .
To get the right answer, we’ll look at real entropy in a slightly different way. Define by
Then has the derivative-like property
A linear map with this property is called a derivation, so it’s reasonable to call a nonlinear derivation.
The observation that is a nonlinear derivation turns out to be quite useful. For instance, real entropy is given by
(), and verifying the chain rule for is done most neatly using the derivation property of .
An equivalent formula for real entropy is
This is a triviality: , so , so this is the same as the previous formula. But it’s also quite suggestive: measures the extent to which the nonlinear derivation fails to preserve the sum .
Now let’s try to imitate this in . Since plays a similar role to , it’s natural to define
for integers not divisible by . But the last expression makes sense even if is divisible by . So, we can define a function
by . (This is called a -derivation.) It’s easy to check that has the derivative-like property
And now we arrive at the long-awaited definition. The entropy mod of is
where represents . This is independent of the choice of representatives . And when you work it out explicitly, it gives
Just as in the real case, satisfies the chain rule, which is most easily shown using the derivation property of .
Before I say any more, let’s have some examples.
In the real case, the uniform distribution has entropy . Mod , this distribution only makes sense if does not divide (otherwise is undefined); but assuming that, we do indeed have , as we’d expect.
When we take our prime to be , a probability distribution is just a sequence of bits like with an odd number of s. Its entropy turns out to be if the number of s is congruent to mod , and if the number of s is congruent to mod .
What about distributions on two elements? In other words, let and put . What is ?
It takes a bit of algebra to figure this out, but it’s not too hard, and the outcome is that for , This function was, in fact, the starting point of Kontsevich’s note, and it’s what he called the -logarithm.
We’ve now succeeded in finding a definition of entropy mod that satisfies the chain rule. That’s not quite enough, though. In principle, there could be loads of things satisfying the chain rule, in which case, what special status would ours have?
But in fact, up to the inevitable constant factor, our definition of entropy mod is the one and only definition satisfying the chain rule:
Theorem Let be a sequence of functions. Then satisfies the chain rule if and only if for some .
This is precisely analogous to the characterization theorem for real entropy, except that in the real case some analytic condition on has to be imposed (continuity in Faddeev’s theorem, and measurability in the stronger theorem of Lee). So, this is excellent justification for calling the entropy mod .
I’ll say nothing about the proof except the following. In Faddeev’s theorem over , the tricky part of the proof involves the fact that the sequence is not uniquely characterized up to a constant factor by the equation ; to make that work, you have to introduce some analytic condition. Over , the tricky part involves the fact that the domain of the “logarithm” (Fermat quotient) is not , but . So, analytic difficulties are replaced by number-theoretic difficulties.
Kontsevich didn’t actually write down a definition of entropy mod in his two-and-a-half page note. He did exactly enough to show that there must be a unique sensible such definition… and left it there! Of course he could have worked it out if he’d wanted to, and maybe he even did, but he didn’t write it up here.
Anyway, let’s return to the quotation from Kontsevich that I began my first post with:
Conclusion: If we have a random variable which takes finitely many values with all probabilities in then we can define not only the transcendental number but also its “residues modulo ” for almost all primes !
In the notation of these posts, he’s saying the following. Let
be a real probability distribution in which each is rational. There are only finitely many primes that divide one or more of the denominators of . For primes not belonging to this exceptional set, we can interpret as a probability distribution in . We can therefore take its mod entropy, .
Kontsevich is playfully suggesting that we view as the residue class mod of .
There is more to this than meets the eye! Different real probability distributions can have the same real entropy, so there’s a question of consistency. Kontsevich’s suggestion only makes sense if
And this is true! I have a proof, though I’m not convinced it’s optimal. Does anyone see an easy argument for this?
Let’s write for the set of real numbers of the form , where is a real probability distribution whose probabilities can all be expressed as fractions with denominator not divisible by . We’ve just seen that there’s a well-defined map
defined by
For , we view as the congruence class mod of . This notion of “congruence class” even behaves something like the ordinary notion, in the sense that preserves addition.
(We can even go a bit further. Accompanying the characterization theorem for entropy mod , there is a characterization theorem for information loss mod , strictly analogous to the theorem that John Baez, Tobias Fritz and I proved over . I won’t review that stuff here, but the point is that an information loss is a difference of entropies, and this enables us to define the congruence class mod of the difference of two elements of . The same additivity holds.)
There’s just one more thing. In a way, the definition of entropy mod is unsatisfactory. In order to define it, we had to step outside the world of by making arbitrary choices of representing integers, and then we had to show that the definition was independent of those choices. Can’t we do it directly?
In fact, we can. It’s a well-known miracle about finite fields that any function is a polynomial. It’s a slightly less well-known miracle that any function , for any , is also a polynomial.
Of course, multiple polynomials can induce the same function. For instance, the polynomials and induce the same function . But it’s still possible to make a uniqueness statement. Given a function , there’s a unique polynomial that induces and is of degree less than the order of in each variable separately.
So, there must be a polynomial representing entropy, of order less than in each variable. And as it turns out, it’s this one:
You can check that when , this is in fact the same polynomial as we met before — Kontsevich’s -logarithm.
It’s striking that this direct formula for entropy modulo a prime looks quite unlike the formula for real entropy,
It’s also striking that in the case , the formula for real entropy is
whereas mod , we get
which is a truncation of the Taylor series of . And yet, the characterization theorems for entropy over and over are strictly analogous.
As I see it, there are two or three big open questions:
Entropy over can be understood, interpreted and applied in many ways. How can we understand, interpret or apply entropy mod ?
Entropy over and entropy mod are defined in roughly analogous ways, and uniquely characterized by strictly analogous theorems. Is there a common generalization? That is, can we unify the two definitions and characterization theorems, perhaps proving a theorem about entropy over suitable fields?
Re: Entropy Modulo a Prime (Continued)
Typo: “definitin”.