Let us, for simplicity, assume that the event space $X$ is finite, with $N$ elements.

Suppose we have a probabilistic model on this space, that is a parametrised probability distribution, which I denote by $$p_θ(x)$$

Suppose now that we observe data. That means points $x_1,\ldots,x_N \in X$.

### Likelihood

A natural idea is to try and maximise the likelihood that this data appears. Perhaps a better way to formulate this is we want to minimise the surprise that this data occurs, given our probability model. We'll see that there is a very good definition of the "surprise".

The likelihood of the data is just: $$p(x_1|θ) p(x_1|θ) \cdots p(x_N|θ)$$ It is more convenient to consider instead the log-likelihood: $$\log(p(x_1|θ)) + \cdots + \log(p(x_N|θ))$$

But here comes a fundamental insight: we can group together the terms with the same data count. So, for the count of three, let us write: $[x = 3] := \# \{i | x_i = 3\}$

Now the log-likelihood becomes $$[x = 0]p(x=0|θ) + \cdots + [x = N]p(x=N|θ)$$

Here comes the insight: the real important information here is provided by the empirical distribution defined as $$q(x = i) := \frac{[x = i]}{N}$$

That is simply the frequency number, for each value.

And now the log-likelihood (rescaled) takes a very interesting value: $$q(x=0) \log(p(x=0|θ)) + \cdots + q(x=N) \log(p(x=N| θ))$$

In other words, it so happens that the log-likelihood is just the empirical mean of the log of the model:

$$E_q [\log(p_θ)]$$

This is quite beautiful!

### Cross and Relative Entropy

There is more to this: suppose we could choose $p(\cdot |θ)$ in any way we wanted? In other word, suppose our model is universal (the distribution $p$ itself is the parameter). Then, it turns out that the maximum possible value is when $p = q$! In equations: $$E_q[\log(p)] ≤ E_q[\log(q)]$$

By rewriting a bit, we obtain

$$E_q [\log(q/p)] ≥ 0$$

The quantity above is called the relative entropy of $q$ with respect to $p$, and is denoted by $D(q || p)$.

Perhaps an alternative name would be the surprise of “observing” the empirical distribution $q$ (remember, this is just the data), given a model $p$. Maximising the likelihood is then just minimising the surprise.

Note that the inequality $D(q||p) ≥0$ is fundamental in statistics. Perhaps it should be called the fundamental inequality in statistics.

So, why is it that $D(q||p) ≥0$? There are several possible proofs. Here is one. Notice that $\log$ is a concave function, or that $-\log$ is a convex function. Essentially the definition of convexity of a function $f\colon V \to \RR$ ($V$ is a vector space) is that for any probability $\dd μ$ on $V$, $f\Big(\int y \dd μ\Big) ≤ \int f(y) \dd μ$ This can be nicely rewritten as $f(E[Y]) ≤ E[f(Y)]$ for any random variable $Y$ with values in the vector space $V$.

Now choose the random variable $Y \colon \RR \to \RR$, defined by $Y(x) = p(x)/q(x)$, and use a given probability $q$ as a probability on $\RR$. Then $E[Y] = \int (p(x)/q(x)) q(x) \dd x = \int p(x) \dd x = 1$, and $E[f(X)] = \int f(p(x)/q(x)) q(x) \dd x = E_q[f(p(x)/q(x))]$. Now use the convex function $f(x) = -\log(x)$ to obtain $D(q||p) ≥ 0$.

### Prior and Regularisation

What about the maximum a posteriori distribution? That is, what if we have a prior $π(θ)$ on the parameter $θ$?

There is no big difference, the quantity to maximise is then $$E_q[p(\cdot | θ)] + \log(π(θ))$$ where $π(θ)$ is the prior distribution on $θ$. We can also reformulate this as minimising $D(q||p_θ) - \log(p(θ))$ Here, one minimises the surprise $D(q||p_θ)$, with $\log(π(θ))$ as regularisation term.

What if you want to go full Bayesian? The posterior distribution $p(θ)$ is again a function of the empirical distribution: $\log(p(θ)) = -D(q||p_θ) + H(q) + \log(π(θ)) - A$ where $A$ corresponds to a normalising constant. It is simply defined as $A := \log \Big( \int \exp(-D(q||p_θ) + H(q) + \log(π(θ))) \dd θ \Big)$

Here again, the dependence with respect to the data is only through the empirical distribution $q$.

### References

Some references on information theory and Bayesian estimation are