Let us, for simplicity, assume that the event space $X$ is finite, with $N$ elements.

Suppose we have a probabilistic model on this space, that is a parametrised probability distribution, which I denote by \begin{equation} p_θ(x) \end{equation}

Suppose now that we observe data. That means points $x_1,\ldots,x_N \in X$.

Likelihood

A natural idea is to try and maximise the likelihood that this data appears. Perhaps a better way to formulate this is we want to minimise the surprise that this data occurs, given our probability model. We'll see that there is a very good definition of the "surprise".

The likelihood of the data is just: \begin{equation} p(x_1|θ) p(x_1|θ) \cdots p(x_N|θ) \end{equation} It is more convenient to consider instead the log-likelihood: \begin{equation} \log(p(x_1|θ)) + \cdots + \log(p(x_N|θ)) \end{equation}

But here comes a fundamental insight: we can group together the terms with the same data count. So, for the count of three, let us write: \[ [x = 3] := \# \{i | x_i = 3\} \]

Now the log-likelihood becomes \begin{equation} [x = 0]p(x=0|θ) + \cdots + [x = N]p(x=N|θ) \end{equation}

Here comes the insight: the real important information here is provided by the empirical distribution defined as \begin{equation} q(x = i) := \frac{[x = i]}{N} \end{equation}

That is simply the frequency number, for each value.

And now the log-likelihood (rescaled) takes a very interesting value: \begin{equation} q(x=0) \log(p(x=0|θ)) + \cdots + q(x=N) \log(p(x=N| θ)) \end{equation}

In other words, it so happens that the log-likelihood is just the empirical mean of the log of the model:

\begin{equation} E_q [\log(p_θ)] \end{equation}

This is quite beautiful!

Cross and Relative Entropy

There is more to this: suppose we could choose $p(\cdot |θ)$ in any way we wanted? In other word, suppose our model is universal (the distribution $p$ itself is the parameter). Then, it turns out that the maximum possible value is when $p = q$! In equations: \begin{equation} E_q[\log(p)] ≤ E_q[\log(q)] \end{equation}

By rewriting a bit, we obtain

\begin{equation} E_q [\log(q/p)] ≥ 0 \end{equation}

The quantity above is called the relative entropy of $q$ with respect to $p$, and is denoted by $D(q || p)$.

Perhaps an alternative name would be the surprise of “observing” the empirical distribution $q$ (remember, this is just the data), given a model $p$. Maximising the likelihood is then just minimising the surprise.

Note that the inequality \[ D(q||p) ≥0 \] is fundamental in statistics. Perhaps it should be called the fundamental inequality in statistics.

So, why is it that $D(q||p) ≥0$? There are several possible proofs. Here is one. Notice that $\log$ is a concave function, or that $-\log$ is a convex function. Essentially the definition of convexity of a function $f\colon V \to \RR$ ($V$ is a vector space) is that for any probability $\dd μ$ on $V$, \[ f\Big(\int y \dd μ\Big) ≤ \int f(y) \dd μ \] This can be nicely rewritten as \[ f(E[Y]) ≤ E[f(Y)] \] for any random variable $Y$ with values in the vector space $V$.

Now choose the random variable $Y \colon \RR \to \RR$, defined by $Y(x) = p(x)/q(x)$, and use a given probability $q$ as a probability on $\RR$. Then $E[Y] = \int (p(x)/q(x)) q(x) \dd x = \int p(x) \dd x = 1$, and $E[f(X)] = \int f(p(x)/q(x)) q(x) \dd x = E_q[f(p(x)/q(x))]$. Now use the convex function $f(x) = -\log(x)$ to obtain $D(q||p) ≥ 0$.

Prior and Regularisation

What about the maximum a posteriori distribution? That is, what if we have a prior $π(θ)$ on the parameter $θ$?

There is no big difference, the quantity to maximise is then \begin{equation} E_q[p(\cdot | θ)] + \log(π(θ)) \end{equation} where $π(θ)$ is the prior distribution on $θ$. We can also reformulate this as minimising \[ D(q||p_θ) - \log(p(θ)) \] Here, one minimises the surprise $D(q||p_θ)$, with $\log(π(θ))$ as regularisation term.

What if you want to go full Bayesian? The posterior distribution $p(θ)$ is again a function of the empirical distribution: \[ \log(p(θ)) = -D(q||p_θ) + H(q) + \log(π(θ)) - A \] where $A$ corresponds to a normalising constant. It is simply defined as \[ A := \log \Big( \int \exp(-D(q||p_θ) + H(q) + \log(π(θ))) \dd θ \Big) \]

Here again, the dependence with respect to the data is only through the empirical distribution $q$.

References

Some references on information theory and Bayesian estimation are

Say thanks!