Let me add one nice explanation of relative entropy that I didn’t have time
to include in the class. I learned it from a MathOverflow answer by
Ashok to a question by John Baez.
Suppose we’re in the presence of some unknown probability distribution on
a finite set . We’re able to observe the results of drawing from this
distribution several times; call these results .
First question: what would we guess is the unknown distribution?
When the question is posed in that generality, the only reasonable answer
is that it’s exactly the distribution we observed. This is called the
empirical distribution of the data. Formally, it’s the linear
combination
where means the point mass (point/Dirac measure) at . So if
are all distinct, the empirical distribution
returns each of with probability . In general, it returns with probability
Second, more subtle, question: let’s now make the assumption that
the unknown distribution belongs to a certain known family of distributions on . Based on the data observed, which would we guess is the
unknown distribution?
This is a classic question of statistical inference, and one way of
answering is to use the maximum likelihood method. This means choosing
to maximize the probability of observing the data that was
observed. Under the distribution , that probability is
So, we choose the value of that maximizes this probability.
(Let’s assume for simplicity that there’s exactly one such ). When
we’re thinking of as fixed and as variable,
this probability is referred to as the likelihood of .
But we can also think geometrically. Relative entropy isn’t a metric, but
it behaves something like it, so we might wish to choose to
minimize the “distance” between and the observed/empirical
distribution . As moves within , think of
as defining a curve in the space of all distributions on . We’re looking for the point on the curve closest to .
More exactly, we want to minimize the entropy of relative to
. So, choose to minimize .
Call this the “minimal entropy method”.
The punchline is this:
The maximum likelihood method and the minimal entropy method are
equivalent.
Indeed, a routine calculation shows that
Since the left-hand side is an increasing function of the likelihood
, and the right-hand side is a
decreasing function of the relative entropy ,
choosing to maximize the likelihood is equivalent to choosing
to minimize the relative entropy.
Re: Functional Equations III: Explaining Relative Entropy
Let me add one nice explanation of relative entropy that I didn’t have time to include in the class. I learned it from a MathOverflow answer by Ashok to a question by John Baez.
Suppose we’re in the presence of some unknown probability distribution on a finite set . We’re able to observe the results of drawing from this distribution several times; call these results .
First question: what would we guess is the unknown distribution? When the question is posed in that generality, the only reasonable answer is that it’s exactly the distribution we observed. This is called the empirical distribution of the data. Formally, it’s the linear combination
where means the point mass (point/Dirac measure) at . So if are all distinct, the empirical distribution returns each of with probability . In general, it returns with probability
Second, more subtle, question: let’s now make the assumption that the unknown distribution belongs to a certain known family of distributions on . Based on the data observed, which would we guess is the unknown distribution?
This is a classic question of statistical inference, and one way of answering is to use the maximum likelihood method. This means choosing to maximize the probability of observing the data that was observed. Under the distribution , that probability is
So, we choose the value of that maximizes this probability. (Let’s assume for simplicity that there’s exactly one such ). When we’re thinking of as fixed and as variable, this probability is referred to as the likelihood of .
But we can also think geometrically. Relative entropy isn’t a metric, but it behaves something like it, so we might wish to choose to minimize the “distance” between and the observed/empirical distribution . As moves within , think of as defining a curve in the space of all distributions on . We’re looking for the point on the curve closest to .
More exactly, we want to minimize the entropy of relative to . So, choose to minimize . Call this the “minimal entropy method”.
The punchline is this:
Indeed, a routine calculation shows that
Since the left-hand side is an increasing function of the likelihood , and the right-hand side is a decreasing function of the relative entropy , choosing to maximize the likelihood is equivalent to choosing to minimize the relative entropy.