Where does the entropy come from?

Let us, for simplicity, assume that the event space $X$ is finite, with $N$ elements.

Suppose we have a probabilistic model on this space, that is a parametrised probability distribution, which I denote by \begin{equation} p_θ(x) \end{equation}

Suppose now that we observe data. That means points $x_1,\ldots,x_N \in X$.

Likelihood

A natural idea is to try and maximise the likelihood that this data appears. Perhaps a better way to formulate this is we want to minimise the surprise that this data occurs, given our probability model. We’ll see that there is a very good definition of the “surprise”.

The likelihood of the data is just: \begin{equation} p(x_1|θ) p(x_1|θ) \cdots p(x_N|θ) \end{equation} It is more convenient to consider instead the log-likelihood: \begin{equation} \log(p(x_1|θ)) + \cdots + \log(p(x_N|θ)) \end{equation}

But here comes a fundamental insight: we can group together the terms with the same data count. So, for the count of three, let us write: \[ [x = 3] := \# \{i | x_i = 3\} \]

Now the log-likelihood becomes \begin{equation} [x = 0]p(x=0|θ) + \cdots + [x = N]p(x=N|θ) \end{equation}

Here comes the insight: the real important information here is provided by the empirical distribution defined as \begin{equation} q(x = i) := \frac{[x = i]}{N} \end{equation}

That is simply the frequency number, for each value.

And now the log-likelihood (rescaled) takes a very interesting value: \begin{equation} q(x=0) \log(p(x=0|θ)) + \cdots + q(x=N) \log(p(x=N| θ)) \end{equation}

In other words, it so happens that the log-likelihood is just the empirical mean of the log of the model:

\begin{equation} E_q [\log(p_θ)] \end{equation}

This is quite beautiful!

Cross and Relative Entropy

There is more to this: suppose we could choose $p(\cdot |θ)$ in any way we wanted? In other word, suppose our model is universal (the distribution $p$ itself is the parameter). Then, it turns out that the maximum possible value is when $p = q$! In equations: \begin{equation} E_q[\log(p)] ≤ E_q[\log(q)] \end{equation}

By rewriting a bit, we obtain

\begin{equation} E_q [\log(q/p)] ≥ 0 \end{equation}

The quantity above is called the relative entropy of $q$ with respect to $p$, and is denoted by $D(q || p)$:

\[ D(q||p) := E_q [\log(q/p)] \]

Perhaps an alternative name would be the surprise of “observing” the empirical distribution $q$ (remember, this is just the data), given a model $p$. Maximising the likelihood is then just minimising the surprise.

Note that the inequality \[ D(q||p) ≥0 \] is fundamental in statistics. Perhaps it should be called the fundamental inequality in statistics.

Why is the relative entropy nonnegative?

So, why is it that $D(q||p) ≥0$? There are several possible proofs. Here is one.

Notice that $\log$ is a concave function, or that $-\log$ is a convex function. The definition of the convexity of a function $f\colon V \to \RR$ ($V$ is a vector space) is equivalent to saying that for any random variable $Y$ taking values in the vector space $V$, we have \[ f(E[Y]) ≤ E[f(Y)] \]

Now, take $V = \RR$, and consider a random variable $X$ with density $q(x)$.

Further define the random variable $Y := p(X)/q(X)$.

Then $E[Y] = \int (p(x)/q(x)) q(x) \dd x = \int p(x) \dd x = 1$.

Now, for any function $f \colon \RR \to \RR$, $E[f(Y)] = \int f(p(x)/q(x)) q(x) \dd x$.

In particular with $f = -\log$, this gives $E[f(Y)] = D(q||p)$.

Now, since $f$ is convex, $f(E[Y]) = f(1) = 0$, and using the inequality above, we obtain $E[f(Y)] ≥ f(E[Y])$, that is, precisely $D(q||p) ≥ 0$.

Prior and Regularisation

What about the maximum a posteriori distribution? That is, what if we have a prior $π(θ)$ on the parameter $θ$?

There is no big difference, the quantity to maximise is then \begin{equation} E_q[p(\cdot | θ)] + \log(π(θ)) \end{equation} where $π(θ)$ is the prior distribution on $θ$. We can also reformulate this as minimising \[ D(q||p_θ) - \log(p(θ)) \] Here, one minimises the surprise $D(q||p_θ)$, with $\log(π(θ))$ as regularisation term.

What if you want to go full Bayesian? The posterior distribution $p(θ)$ is again a function of the empirical distribution: \[ \log(p(θ)) = -D(q||p_θ) + H(q) + \log(π(θ)) - A \] where $A$ corresponds to a normalising constant. It is simply defined as \[ A := \log \Big( \int \exp(-D(q||p_θ) + H(q) + \log(π(θ))) \dd θ \Big) \]

Here again, the dependence with respect to the data is only through the empirical distribution $q$.

References

Some references on information theory and Bayesian estimation are

On cross/relative entropy: Elements of information theory by Cover & Thomas.
On Bayesian estimation in general: Machine Learning: a Probabilistic Perspective by Murphy.