Let us, for simplicity, assume that the event space $X$ is finite, with $N$ elements.

Suppose we have a **probabilistic model** on this space, that is a parametrised probability distribution, which I denote by
\begin{equation}
p_θ(x)
\end{equation}

Suppose now that we observe *data*.
That means points $x_1,\ldots,x_N \in X$.

### Likelihood

A natural idea is to try and *maximise the likelihood* that this data appears.
Perhaps a better way to formulate this is *we want to minimise the surprise that this data occurs, given our probability model*.
We'll see that there is a very good definition of the "surprise".

The likelihood of the data is just:
\begin{equation}
p(x_1|θ) p(x_1|θ) \cdots p(x_N|θ)
\end{equation}
It is more convenient to consider instead the **log-likelihood**:
\begin{equation}
\log(p(x_1|θ)) + \cdots + \log(p(x_N|θ))
\end{equation}

But here comes a fundamental insight: **we can group together the terms with the same data count**.
So, for the count of three, let us write:
\[
[x = 3] := \# \{i | x_i = 3\}
\]

Now the log-likelihood becomes \begin{equation} [x = 0]p(x=0|θ) + \cdots + [x = N]p(x=N|θ) \end{equation}

Here comes the insight: the real important information here is provided by the **empirical distribution** defined as
\begin{equation}
q(x = i) := \frac{[x = i]}{N}
\end{equation}

That is simply the frequency number, for each value.

And now the log-likelihood (rescaled) takes a very interesting value: \begin{equation} q(x=0) \log(p(x=0|θ)) + \cdots + q(x=N) \log(p(x=N| θ)) \end{equation}

In other words, it so happens that the log-likelihood is just the **empirical mean** of the log of the model:

\begin{equation} E_q [\log(p_θ)] \end{equation}

This is quite beautiful!

### Cross and Relative Entropy

There is more to this: suppose we could choose $p(\cdot |θ)$ in any way we wanted? In other word, suppose our model is universal (the distribution $p$ itself is the parameter). Then, it turns out that the maximum possible value is when $p = q$! In equations: \begin{equation} E_q[\log(p)] ≤ E_q[\log(q)] \end{equation}

By rewriting a bit, we obtain

\begin{equation} E_q [\log(q/p)] ≥ 0 \end{equation}

The quantity above is called the **relative entropy** of $q$ with respect to $p$, and is denoted by $D(q || p)$.

Perhaps an alternative name would be the **surprise** of “observing” the empirical distribution $q$ (remember, this is just the data), given a model $p$.
*Maximising the likelihood is then just minimising the surprise*.

Note that the inequality
\[
D(q||p) ≥0
\]
is **fundamental** in statistics.
Perhaps it should be called the fundamental inequality in statistics.

So, why is it that $D(q||p) ≥0$? There are several possible proofs. Here is one. Notice that $\log$ is a concave function, or that $-\log$ is a convex function. Essentially the definition of convexity of a function $f\colon V \to \RR$ ($V$ is a vector space) is that for any probability $\dd μ$ on $V$, \[ f\Big(\int y \dd μ\Big) ≤ \int f(y) \dd μ \] This can be nicely rewritten as \[ f(E[Y]) ≤ E[f(Y)] \] for any random variable $Y$ with values in the vector space $V$.

Now choose the random variable $Y \colon \RR \to \RR$, defined by $Y(x) = p(x)/q(x)$, and use a given probability $q$ as a probability on $\RR$. Then $E[Y] = \int (p(x)/q(x)) q(x) \dd x = \int p(x) \dd x = 1$, and $E[f(X)] = \int f(p(x)/q(x)) q(x) \dd x = E_q[f(p(x)/q(x))]$. Now use the convex function $f(x) = -\log(x)$ to obtain $D(q||p) ≥ 0$.

### Prior and Regularisation

What about the maximum *a posteriori* distribution? That is, what if we have a prior $π(θ)$ on the parameter $θ$?

There is no big difference, the quantity to maximise is then
\begin{equation}
E_q[p(\cdot | θ)] + \log(π(θ))
\end{equation}
where $π(θ)$ is the prior distribution on $θ$.
We can also reformulate this as *minimising*
\[
D(q||p_θ) - \log(p(θ))
\]
Here, one minimises the surprise $D(q||p_θ)$, with $\log(π(θ))$ as **regularisation term**.

What if you want to go full Bayesian? The posterior distribution $p(θ)$ is again a function of the empirical distribution: \[ \log(p(θ)) = -D(q||p_θ) + H(q) + \log(π(θ)) - A \] where $A$ corresponds to a normalising constant. It is simply defined as \[ A := \log \Big( \int \exp(-D(q||p_θ) + H(q) + \log(π(θ))) \dd θ \Big) \]

Here again, the dependence with respect to the data is only through the empirical distribution $q$.

### References

Some references on information theory and Bayesian estimation are

- On cross/relative entropy: Elements of information theory by Cover & Thomas.
- On Bayesian estimation in general: Machine Learning: a Probabilistic Perspective by Murphy.